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FOREWORD 


In  a letter  dated  24  February  1976,  Professor  J.  Barkley  Rosser,  then 
Acting  Director  of  the  Mathematics  Research  Center  (MRC),  issued  an 
invitation  on  behalf  of  the  MRC  Executive  Committee  to  hold  the  1977 
Army  Numerical  Analysis  and  Computers  Conference  at  the  University  of 
Wisconsin.  He  stated  that  the  facilities  of  the  Wisconsin  Center  had 
been  reserved  for  a full  week  starting  28  March  1977.  The  first  part  of 
that  week  would  be  devoted  to  an  MRC  sponsored  symposium  on  Mathematical 
Software,  and  during  the  rest  of  the  week  they  would  be  happy  to  serve 
as  the  host  of  the  Army  conference.  Holding  these  two  meetings  back-to- 
back  was  deemed  an  excellent  idea,  and  the  invitation  issued  by  Dr.  Rosser 
was  readily  accepted  by  the  members  of  the  Subcommittee  on  Numerical 
Analysis  and  Computers  of  the  Army  Mathematics  Steering  Committee  (AMSC). 
The  AMSC  sponsors  these  conferences,  and  this  subcommittee  is  responsible 
for  their  conduction.  [For  those  interested  in  the  names  of  the  speakers 
at  the  Mathematical  Software  Symposium,  the  program  of  this  meeting  is 
printed  on  two  of  the  following  pages.] 
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the  theme  of  the  1977  Army  Numerical  Analysis  and  Computers  Conference 
was  "the  Numerical  Solution  of  Partial  Differential  Equations".  Many  of 
the  papers  presented  stressed  ideas  related  to  this  theme,  the  keynote 
address  was  delivered  by  Professor  Peter  Lax  of  the  Courant  Institute  of 
Mathematical  Sciences.  He  spoke  to  the  group  on  'Survey  of  Recent  Techniques 
for  Solving  Hyperbolic  Equations"',  the  second  hour  address  was  entitled 
"Progress  in  the  Calculation  of  Two  and  Three  Dimensional  Boundary  Layers"”, 

C+  r*  t 

and  was  delivered  by  Professor  Tuncer  Cebeci  of  California  State  University 


i i i 


J 


at  Long  Beach.  Drs.  James  Ortega  and  David  A.  Fisher  gave  the  two  other 

invited  addresses.  Dr.  Ortega,  ICASE,  NASA  at  Langley  Research  Center, 

— \ 

spoke  on  "Solution  of  Partial  Differential  Equations  on  Vector  Computers". 
Dr.  Robert  G.  Voight,  also  of  ICASE,  was  a coauthor  of  this  paper.  David 
A. Fisher,  of  the  Institute  for  Defense  Analysis  at  Arlington,  Virginia, 
chose  to  speak  on  the  topic  "Numeric  Computation  Facilities  of  a Common 
Programming  Language  for  DOD4*.  In  addition  to  the  invited  addresses, 
there  were  several  contributed  papers.  All  of  these  talks  were  of  high 
caliber,  and  we  are  pleased  to  announce  that  many  of  the  papers  presented 
at  this  meeting  appear  in  these  proceedings. 

Members  of  the  AMSC  would  like  to  thank  MRC  for  serving  as  host  of  this 
confe  ence;  and  to  cite  especially  Professors  Louis  Rail  and  J.  M.  Yohe 
for  the  splendid  way  they  handled  the  many  problems  associated  with  the 
local  arrangements.  Their  thanks  also  go  to  all  the  speakers.  They  have 
asked  that  these  proceedings  be  issued  so  that  those  who  did  attend  this 
conference  could  have  an  opportunity  to  study  the  manuscripts  of  the 
presented  papers,  and  other  scientists  not  able  to  attend  could  still 
benefit  from  the  scientific  results  obtained  by  the  various  authors. 
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SYMPOSIUM  ON  MATHEMATICAL  SOFTWARE 

.PROGRAM 


SUNDAY,  MARCH  27.  1977 


p.m. 

6:00-  Registration  and  Open  House.  Blue 

10:00  Lounge,  The  Wisconsin  Center,  702 

Langdon  Street 


MONDAY,  MARCH  28,  1977 


6:00 

8:45 

CESSION  I 
9:00 


10:00 

10:30 


Registration,  lirst  floor.  The  Wisconsin 
Center 

Welcome,  Ben  Noble,  Director,  Mathe- 
matics Reach  Center 

Chaired  by  C.  B.  Moler,  University  of 
New  Mexico 

Speaker:  G.  H.  Golub,  Stanford 
University 

Topic:  The  Block  Lanczos  Method 

for  Computing  Eigenvalues 
(Joint  work  with  R.  Underwood. 
General  Electric,  San  Jose) 
Coffee,  Exhibit  Gallery 
Speaker:  G.  W.  Steivjrt,  University  of 
Maryland 

Topic:  Research,  Development,  and 


TUESDAY, 

MARCH  29,  1977 

«jn. 

SESSION  III 

Chaired  by  W.  J.  Cody,  Argonne 
National  Laboratory 

9:00 

Speaker:  C-  W.  Gear,  University  of 

Illinois 

Topic:  Simulation:  Conflicts 
Between  Real  Time  and 
Software 

10:00 

Coffee,  Exhibit  Gallery 

10:30 

Speaker:  D.  C.  Haagltn,  Harvard 
University 

Topic:  Mathematical  Software  ana 

Exploratory  Data  Analysis 

11:30 

Speaker:  C.  L.  Lawson,  Jet  Propul- 
sion Laboratory 

Topic:  Software  for  C'  Surface 

Interpolation 

pjn. 

12:30 

Lunch  Break 

11:30 

UNPACK 

SESSION 

IV  Chaired  by  W.  Kahan,  University  of 

Speaker:  M.  J.  D.  Powell,  Cambridge 

California,  Berkeley 

pm. 

University 

Topic:  A Technique  that  Gains 

Speed  and  Accuracy  in  the 

Minlmax  Solution  of  Over- 
determined Linear  Equations 

2:15 

Speaker:  W.  R.  Cowell,  Argonne 
National  Laboratory/ 

L.  D.  Fosdtck,  University  c 

Colorado 

Topic:  Mathematical  Software 

Production 

12:30 

Lunch  Break 

3:15 

Cottee,  Exhibit  Gallery 

SESSION  II 

Chaired  by  J.  H.  Gnesmer,  IBM 

Research,  Yorktown  Heights 

3:30 

Speaker:  IV.  S.  Sroivn,  Bell 
Laboratories 

2:15 

Speaker:  G.  E.  Collins.  University  of 

4:30 

Topic:  Porta  bi  lily 

Wiscons  in-Madison 

End  ol  Session 

3:15 

Topic:  Infallible  Calculation  of  Poly- 

nomial Zeros  to  Specified 

Precision 

6:30 

Cocktails  (cash  bar),  Alumni  Lounge. 
The  Wisconsin  Center,  /02  Langccn 

Street 

Coffee,  Exhibit  Gallery 

7:30 

Dinner,  The  Wisconsin  Center  Dining 

3 JO 

4:30 

Speaker:  R.  £.  Barnhill,  University  of 

Utah 

Topic:  Representation  and  Approxi- 

mation of  Surfaces 

End  of  Session 

Room 

(This  program  is  completed  on  the  neact  page) 
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WEDNESDAY,  MARCH  30,  1977 

a.m. 

SESSION  V 

Chaired  by  T.  E.  Hull,  University  of 
Toronto 

9:00 

Speaker:  /.  Babuska,  University  of 
Maryland 

Topic:  Computational  Asoects  of  the 

Finite  Element  Method  (|Oint 
Work  with  W.  Rhemboict, 
University  ot  Maryland) 

10:00 

Coffee,  Exhibit  Gallery 

10:30 

Speaker:  L.  F.  Shampme,  Sandia 
Laboratories 

Topic:  The  Art  of  Writing  a Runge- 

Kutta  Code 

1130 

Speaker:  A.  Brandt,  Weizmann  Institute 
Topic:  Multi-Level  Acaotive  Tech- 

niques for  Partial  Differential 
Equations:  ice  as  and 

Software 

<230 

End  of  Program 

program  committee 

J.  R.  Rice,  Chairman 
C.  de  Door 
J.  M.  Yohe 

Gladys  G.  Moran,  Secretary 
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1977  ARMY  NUMERICAL  ANALYSIS  AND  COMPUTERS  CONFERENCE 
Mathematics  Research  Center 
University  of  Wisconsin 
Madison,  Wisconsin 


All  sessions  will  be  hptd  in  the  Wisconsin  Center,  Lake  and  Langdon  Streets, 
Madison,  Wisconsin 


Wednesday  Afternoon,  30  March  1977 

1300-1345  REGISTRATION  - EXHIBIT  GALLERY 

1345-1400  OPENING  REMARKS  - AUDITORIUM 

1400-1500  KEYNOTE  ADDRESS 

CHAIRPERSON  - Hermann  R.  Robl  , U.  S.  Army  Research  Office, 
Research  Triangle  Park,  North  Carolina 

SPEAKER  - Peter  Lax,  Courant  Institute  of  Mathematical 
Sciences,  New  York,  New  York 

TITLE  - SURVEY  OF  RECENT  TECHNIQUES  FOR  SOLVING  HYPER30LIC 
EQUATIONS 

1500-1515  BREAK  - Coffee  in  the  Exhibit  Gallery 

1515-1715  TECHNICAL  SESSION  I - AUDITORIUM 

CHAIRPERSON  - Norman  Banks,  Ballistic  Research  Laboratories, 

Aberdeen  Proving  Ground,  Maryland 

A FINITE  ELEMENT  METHOD  FOR  SYSTEMS  OF  TIME  DEPENDENT 
HYPFRBOL I C EQUATIONS  IN  TWO  S-AT1AL  DIMENSIONS  APPLIED 
TO  UNSTEADY  GAS  fLO'W 

James  A.  Schmitt,  Ballistic  Research  Laboratories, 
Aberdeen  Proving  Ground,  Maryland 

THE  NUMERICAL  SOLUTION  OF  POROUS  PLOW  FREE  BOUNDARY 
PROBLEMS 

Colin  W.  C ryer.  Mathematics  Research  Center,  University 
of  Wisconsin,  Madison,  Wisconsin 

LOCALLY  ONE  DIMENSIONAL  METHODS  FOR  FREE  BOUNDARY 
PROBLEMS 

Gunter  H.  Meyer,  Georgia  Institute  of  Technology, 

Atlanta,  Georgia 

ON  THE  NUMERICAL  SOLUTION  OF  POISSON’S  EQUATION  BY  THE 
CAPACITANCE  MATRIX  METHOD 

Arthur  Shieh,  Mathematics  Research  Center,  University 
of  Wisconsin,  Madison,  Wisconsin 

A POWER  SERIES  SOI UT I ON  OF  A HARMONIC  MIXED  BOUNDARY 
VALUE  PROBLEM 

J.  Barkley  Rosser  and  N.  Papamichael,  Mathematics 
Research  Center,  University  of  Wisconsin,  Madison, 
Wisconsin 

A FOURIER-SOLUTION  OF  PARABOLIC  POE  BY  TAYLOR  SERIES 
Y.  F.  Chang,  University  of  Netraska,  Lincoln.  Nebraska 

A *S!NC-GALERKIN"  METHOD  OF  SOLUTION  OF  BOUNDARY  VALUE 
PROBLEMS 

Frank  Stenger,  University  of  British  Columbia, Vancouver, 
B.C.,  Canada 
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1515-1715 


TECHNICAL  SESSION  II 


ROOM  ZIO 
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CHAIRPERSON  - Bruce  8arnett,  ARRADCOM.  Dover, 

New  Jersey 

A COOPERATIVE  EFFORT  FOR  THE  STUDY  OF  NUMERICAL  METHODS 
FOR  ELLIPTIC  PARTIAL  DIFFERENTIAL  EQUATIONS-ELLPACK 
John  R.  Rice,  Purdue  University,  West  Lafayette,  Indiana 

STORAGE  AND  RETRIEVAL  OF  SYSTEMS  OF  POES,  THEIR  SOLUTIONS, 
AND  IDENTIFYING  INFORMATION 

Morton  A.  Hi rschberg'and  Joseph  Lacetera,  Jr.,  Ballistic 
Research  Laboratories,  Aberdeen  Proving  Ground,  Maryland 

A PARALLEL  ARRAY  COMPUTER  FOR  THE  SOLUTION  OF  FIELD 
PROBLEMS 

W.  R.  Cyre,  C.  J.  Davis,  A.  A.  Frank,  L.  Jedynak, 

M.  J.  Redmond,  and  V.  C.  Rideout* , Uni versi ty  of 
Wisconsin,  Madison,  Wisconsin 

SOFTWARE  FOR  INTERVAL  ARITHMETIC:  A REASONABLY  PORTABLE 
PACKAGE 

J.  M.  Yohe,  Matnematics  Research  Center,  University  of 
Wisconsin,  Madison,  Wisconsin 

automatic  differentiation  of  computer  programs 

G.  Kedem,  Mathematics  Research  Center,  University  of 
Wisconsin,  Madison,  Wisconsin 

ON  BOUNDARY  EXTRAPOLATION  AND  DISSIPATIVE  SCHEMES  FOR 
HYPERBOLIC  PROBLEMS 

Moshe  Goldberg,  University  of  California,  Los  Angeles, 
Cali fornia 

PANEL  DISCUSSION  - LARGE  SCAl E MATHEMATICAL  SOFTWARE 
IN  ARMY  LABORATORIES  - ROOM  316 

PANEL  MODERATOR  - Paul  T.  Boggs,  U.  S.  Army  Research 
Office,  Research  Triangle  Park,  North  Carolina 

PANEL  MEMBERS  - J.  Michael  Yohe.  Mathematics  Research 
Center,  University  of  Wisconsin,  Madison.  Wisconsin; 

John  Rice.  Purdue  University.  West  Lafayette.  Indiana; 

John  Kring,  Air  Mobility  R&D  Laboratory,  Cleveland, 

Ohio;  John  Shipley,  Air  Mobility  R&D  Laboratory,  NASA- 
LANGLEY  Research  Center,  Hampton.  Virqinia;  Dennis 
Tracey.  U.  S.  Army  Materials  & Mechanics  Research  Center, 
Watertown,  Massachusetts;  M.  A.  Hussain,  Magqs  Research 
Center,  Watervliet  Arsenal,  Watervliet,  New  York; 

Norman  Banks,  Ballistic  Research  Laboratories.  Aberdeen 
Proving  Ground,  Maryland 

Thursday.  31  March  1977 

GENERAL  SESSION  I - AUDITORIUM 

CHAIRPERSON  - Robert  E.  Singleton,  U.  S.  Army  Research 
Office,  Research  Triangle  Park,  North  Carolina 

SPEAKER  - Tuncer  Cebeci,  California  State  University, 

Long  Beach,  California 

TITLE  - PROGRESS  IN  THE  CALCULATION  OF  TWO  AND  THREE 
DIMENSIONAL  BOUNDARY  LAYERS 
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0930-1030 


TECUNICAI  ESSION  Ml 


AUDITORIUM 


0930-1030 


0930-1030 


1030-1045 

1045-1145 


CHAIRPERSON  - De-mis  Tracey,  0.  3.  Arir y Materials  A 
Mechanics  Research  Center,  Watertown,  Massachusetts 

APPROXIMATION  WITH  VFB-SPL INES 

Royce  W.  Soanes,  Jr.,  Watervliet  Arsenal,  Watervliet, 

New  York 

COMPUTATION  0r  L~  SPLINE  APPROX  1 “ANTS  WITH  APPLICATIONS 
Charles  K.Chui;  Philip  W.  Smith*,  Jeff  Chow;  Texas 
ASM  University,  College  Station,  Texas 

AN  ADAPTIVE  ROUTINE  FOR  NUMERICAL  QUADRATURE 

Arthur  Mausner,  Harry  Diamond  Laboratories,  Adelphi, 
Maryland 

technical  SESSION  IV  - ROOM  ? 7 

CHAIRPERSON  - Edward  Poss,  S.  Army  Natick  RSD  Command, 
Natick,  Massachusetts 

ON  THE  OPTIMALITY  'F  Tut  RAYLEIGH-RIT?  APPROXIMATION 
S.  C.  Eisensta’,  R Schreiber*,  M.  H Schultz,  Yale 
University,  New  naven,  Connecticut 

THE  APPROXIMATE  SOLOMON  OT  GENERALIZED  HAMILTONIAN 
PROBLEMS 

B.  Noble*  and  J.  V /ago,  Mathematics  Research  Center, 
University  of  Wisconsin,  uadison,  Wisconsin 

GALER'IN  METHODS  i R ROTATiONALlY  SUMMETRIC  PARTIAL 
DIFFERENTIAL  EQUATIONS 

Dennis  C.  Jesperson,  Mathematics  Research  Center, 

University  of  Wise on: in,  Madison,  Wisconsin 

TECHNICAL  SESSION  V - ROOM  313 

CHAIRPERSON  - Sylvan  Eisr-an,  F>ankford  Arsenal, 

Philadelphia,  Pennsylvania 

NUMERICAL  SOLUTION  OF  GUN  TUBE  PROBLEMS  IN  THE  ELASTIC- 
PLASTIC  RANGE 

Peter  C.  T.  Chen,  Wat°rvliet  Arsenal,  Watervliet,  New 
York 

remarks  on  n ‘merical  analysis  or  radon  integral  equation 

AND  PICTURE  RECONSTRUCTION  FROM  PROJECTIONS 

M.  Z.  Nashed,  University  of  Michigan,  Ann  Arbor, 

Michigan 

PERTURBATION  METHODS  FOR  THE  SOLUTION  OF  LINEAR  PROBLEMS 
L.  B.  Rail,  Mathematics  Research  Center,  University 
of  Wisconsin,  Madison,  Wisconsin 

BREAK  - Coffee  in  the  Exhibit  Gallery 

GENERAL  SESSION  II  - AUDITORIUM 

CHAIRPERSON  - M.  A.  Hussain  - Watervliet  Arsenal,  Watervliet, 
New  York 

SPEAKER  - Dr.  James  M.  Ortega,  ICASE,  Nasa  Langley  Research 
Center,  Hampton,  Virginia 

TITLE  - SOLUTION  OF  PARTIAL  DIFFERENTIAL  EQUATIONS  ON 
VECTOR  COMPUTERS  (Dr.  Robert  G.  Voigt,  Co-author) 
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11451 300  LUNCH 


1300-1400 


1300-1400 


1400-1415 

1415-1515 


1515-1615 


TECHNICAL  SESSION  VI  - AUDITORIUM 

CHAIRPERSON  - Shirley  J.  Smith,  Aviation  Systems  Command, 

St.  Louis,  Missouri 

A MATHEMATICAL  SOLUTION  TO  THE  BALLISTIC  DIFFERENTIAL 
EQUATION  PROGRAMMED  TOR  THE  TEXAS  INSTRUMENT  SR-SB 
T H.  Slook.  Temple  University  and  Frankford  Arsenal, 
Philadelphia,  Pennsylvania 

APPROXIMATE  DECOMPOSITIONS  OR  SPARSE  MATRICES 

H.  A van  der  Vorst,  European  Research  Office,  London, 
and  Academic  Computer  Center  Utrecht,  Netherlands 

A GLOBAL  ELEMENT  METHOD  FOR  ELLIPTIC  P.O.E.'s 
l “ Delves,  University  of  Victoria,  Victoria,  British 
Columbia,  Canada 

TECHNICAL  SESSION  '.II  - ROOM  227 

CHAIRPERSON  - Petet  C.  T.  Chen,  Watervliet  Arsenal, 

Watervl let , New  vork 

COMMENTS  ON  THE  SOLUTION  OF  COUPLED  STIFF  DIFFERENTIAL 
EQUATIONS 

M.  D.  Kregel  and  J.  M.  Heimerl.  Ballistic  Research 
Laboratories,  Aberdeen  Proving  Ground,  Maryland 

A -STABLL ITV  Of  BROWN'S  MULTISTEP  MUL T IDERI VAI I VE  METHODS 
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ABSTRACT.  The  report  contains  a discussion  of  a numerical  method  for 
solving  systems  of  first  order  time  dependent  hyperbolic  equations  in  two 
spatial  variables.  This  scheme  which  carbines  the  finite  element  method- 
ology and  the  properties  of  a hyperbolic  system  of  differential  equations 
is  applied  to  unsteady  gas  flow  problems.  The  formulation  is  based  on  the 
elementwise  least  squares  minimization  of  the  differential  residual  error 
and  on  the  construction  of  the  finite  elements  in  both  space  and  tine.  The 
methodology  is  presented  in  detail.  Numerical  experiments  involving  both 
smooth  and  shocked  flews  are  discussed.  Areas  of  possible  future  code 
development  are  proposed. 

1.  INTRODUCTION.  Although  the  finite  element  method  is  a proven, 
effective  method  for  obtaining  numerical  solutions  of  solid  mechanics  pro- 
blems [1] 1 , its  impact  on  computational  fluid  dynamics  has  been  felt  only  in 
the  past  few  years  [2] . Because  of  the  diverse  applications  of  this  method 
in  continuum  mechanics,  many  departures  from  the  original  method  used  in 
structural  analysis  have  been  made.  Our  adaptation  of  the  finite  elenent 
method  for  direct  application  to  unsteady  gas  flows  in  two  spatial  variables 
is  based,  in  part,  on  Lynn  and  Arya's  least  squares  formulation  [3,4]  and  on 
Polk's  one  dimensional  study  [5],  Lynn  and  Arya's  approach  is  based  on  the 
elementwise  least  squares  minimization  of  the  differential  residual  error 
which  allows  a direct  finite  element  formulation  fran  the  governing  differ- 
ential equations.  Furthermore,  since  the  governing  equations  are  hyperbolic, 
the  finite  elements  can  be  constructed  in  both  space  and  tine  so  that  they 
approximate  the  domain  of  determinancy  associated  with  hyperbolic  problems. 
Polk  combined  these  two  concepts  and  applied  them  to  the  unsteady  i sen  tropic 
flew  of  an  inviscid  gas  expanding  behind  a piston.  Using  both  linear  and 
quadratic  approximations  to  the  dependent  variables,  he  showed  good  agreement 
between  the  finite  element  results  and  the  exact  solution  for  the  smooth 
portion  of  the  flew.  Near  the  gradient  discontinuities  which  occurred  in  the 
flew,  Polk  constructed  the  finite  elements  so  that  a side  of  the  element  and 
the  locus  of  discontinuities  coincided.  Using  the  special  construction,  good 
-eement  was  obtained  everywhere  in  the  flow. 

This  report  is  concerned  with  a finite  element  method  for  unsteady  in- 
viscid ampressible  flows  in  twe  spatial  dimensions,  a corresponding  pilot 
computer  cole  and  seme  resulting  numerical  experiments.  The  system  of 


1 Numbers  in  brackets  designate  references  at  end  of  report. 
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governing  equations  are  nonlinear  hyperbolic  equations.  We  take  advantage 
of  this  fact  to  simplify  the  general  finite  element  method.  This  is  con- 
trary bo  Vte Ilford  and  Oden's  [6]  parabolic  regularization  method  for  hyper- 
bolic problems.  In  Wellford  and  Oden's  technique,  certain  terms  which  depend 
on  the  discretization  parameters  are  appended  to  the  equations  so  tliat  they 
become  parabolic.  This  parabolic  problem  is  then  solved  by  a finite  element 
technique.  It  can  be  shown  for  a class  of  problems  that  the  solution  of  the 
parabolic  problem  converges  to  the  original  hyperbolic  solution  in  the 
limit  as  the  mesh  size  tends  to  zero.  On  the  other  hand,  our  formulation 
deals  directly  with  the  hyperbolic  equations. 

Our  construction  of  the  finite  elements  is  reminiscent  of  Polk's  con- 
struction in  that  they  are  in  both  space  and  time  but  differ  in  that  they 
enclose,  not  coincide  with,  the  detrain  of  dependence.  It  will  be  shown  that 
the  construction  simplifies  the  necessary  integration  routines  while  satis- 
fying a Courant  condition.  No  special  construction  of  the  finite  element  is 
provided  near  steep  gradients  and  discontinuities  in  the  flow.  Consequently, 
no  special  knowledge  of  the  solution  is  required  and  all  interior  nodes  in  the 
calculation  are  treated  identically.  Furthermore,  by  applying  Lynn  and  Arya's 
least  squares  minimization  to  each  finite  element,  we  can  avoid  the  large 
matrices  generally  associated  with  the  finite  elanent  method  for  elliptic 
problems  while  still  retaining  the  essential  advantages  of  the  method. 

The  general  methodology  (the  construction  of  the  finite  elements,  the 
approximations  to  the  flow  variables  and  the  formulation  in  terms  of  the  least 
squares  error  criterion)  is  explained  in  section  2.  Section  3 contains  a hrief 
discussion  of  the  form  of  the  governing  equations  and  certain  approximations 
used  within  the  code.  The  results  of  two  numerical  experiments  are  given  and 
discussed  in  section  4.  Section  5 contains  a sunmary  of  the  method  and  areas 
in  which  future  work  is  required. 

2.  GENERAL  METHODOLOGY.  The  first  step  in  the  finite  element  method- 
ology is  to  divide  the  solution  region  into  elements.  We  divide  the  compu- 
tational demain  for  a given  time  (a  two  dimensional  region)  into  triangular 
elements.  The  vertices  of  the  triangles  are  called  nodes.  We  assume  that  the 
boundaries  are  stationary  so  that  the  triangular  divisions  remain  unchanged 
with  time.  The  system  of  governing  equations  for  compressible  fluid  flow  in- 
clude coupled  nonlinear  partial  differential  equations  which  express  conserva- 
tion of  mass,  momentum  and  energy  plus  an  algebraic  equation  of  state.  Be- 
cause the  governing  equations  are  hyperbolic,  the  solution  at  a point 
(x,  y,  t + At)  in  the  solution  demain  depends  only  on  the  value  of  the  flow 
variables  within_the  intersection  of  the  domain  of  dependence  (the  mach  cone) 
frem  the  point  (x,  y,  t + At)  and  the  (x,  y,  t)  plane.  Consequently,  given 
a union  of  triangles  in  the  (x,  y,  t)  plane  with  a vertex  at  the  point  (x,  y,  t) 
and  the  values  of  the  flew  variables  at  the  corresponding  nodes,  we  can  compute 
a value  of  At  such  that  the  mach  oone  from  the  point  (x,  y,  t + At)  lies  within 
the  union  of  triangles.  Since  the  computed  value  of  At  will  vary  from  node  to 
node,  we  take  the  minimum  value  of  At  over  all  codes  as  the  next  tine  step. 

This  value  allows  a systanatical  advance  in  time. 
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We  new  define  our  finite  element  at  a point  (x,  y,  t + At)  as  the 
union  of  all  prisms  with  a base  vertex  at  the  point  (x,  y,  t)  and  with  a 
uniform  height  At.  See  Figure  1.  The  present  one  offers  several  desirable 
simplifications  over  other  possible  constructions  of  the  finite  element. 

From  the  discussion  above,  it  is  clear  that  the  values  of  the  flow  variables 
at  a node  are  independent  of  the  values  at  the  other  nodes  at  the  same  tine 
level.  Thus , the  interconnection  of  the  nodal  values  which  is  characteristic 
of  finite  element  formulations  of  elliptic  problems  and  which  result  in  the 
manipulation  of  large  matrices  can  and  will' be  avoided.  We  will  solve  for  the 
central  nodal  values  at  this  new  time  level  by  considering  only  the  finite 
element  at  the  central  node.  Furthermore,  any  necessary  time  integrations  over 
the  finite  element  are  simplified,  since  the  nodes  of  the  elements  are  inde- 
pendent of  time. 

The  next  step  is  to  choose  the  interpolating  or  trial  functions  over  the 
t;LnJ-te  element.  Let  m be  a flow  variable;  that  is,  a dependent  variable  computed 
directly  from  the  governing  differential  equations,  not  the  equation  of  state. 

The  interpolating  function  for  w is 


“ (x,y,t)  = u)o(x,y,t)  + f (a.x  + ai+1y  + ai+7)  , (1) 

where  the  points  (x,  y,  t)  lie  within  the  finite  element  at  (x,  y,  t + At)  and 
the  parameters  a^,  , a^^  3X0  to  be  determined.  The  function  (x,  y,  t) 

represents  the  flow  variable  u at  time  t and  is  assumed  known.  These  interpola- 
tion functions  form  over  each  finite  element  a linear  approximation  in  space  to 
the  time  derivative  of  u. 

To  complete  the  model,  a technique  to  determine  the  parameters  a^  and  thus 

the  flow  variables,  is  needed.  A basic  idea  in  the  finite  element  methodolooy 
is  to  minimize  the  errors  occurring  from  the  residual  of  the  governing  differen- 
tial equations  in  terms  of  the  interpolating  functions.  Thus,  the  finite 
element  method  approximates  the  minimum  residual  whereas  the  finite  difference 
method  approximates  the  differential  equations.  Since  the  proposed  analysis  is 
elementwise,  we  desire  a minimization  technique  for  each  element.  Tb  this  end, 
we  choose  the  elementwise  least  squares  minimization  of  the  differential  residual 
error  employed  by  Lynn  and  Arya. 

We  substitute  the  interf  ating  functions  into  the  k-th  governing  differ- 
ential equation, ^make  the  re-  dimensionless  and  denote  the  resulting  residual 
by  Dj,.  (x,  y,  t;  a),  where  a le  vector  of  unknown  parameters  a.  . Each 

residual  is  dimensionles  that  no  individual  will  numerically  dominate 

the  least  squares  sum  of  al.  .a  residuals  because  of  dimensional  disparities. 

The  D^'s  are  explicit  algebr.  _c  functions  of  the  independent  variables  x,  y,  t 

and  the  parameters  a.  The  total  error  in  the  sense  of  least  squares  over  a 
particular  finite  element  is  denoted  by  E(a)  and  is  given  by 


r r /-N oeq  2 

E(a)  =J  J J £ D^fx^tja)  dxdydt, 
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where  NOEQ  is  the  number  of  governing  eguations  and  V is  the  volume  of  the 
finite  element.  We  wish  to  minimize  E(a)  with  respect  to  the  a.’ s.  A 

necessary  condition  for  the  existence  of  a minimum  is  3E/3a^  = 0 or 


3Dk 

dxdydt  = 0 


for  each  i. 


(3) 


Note  that  the  derivatives  3D^/3a^  can  be  explicitly  calculated.  By  solving 
this  nonlinear  algebraic  system  of  equations  for  the  unknowns  a . , we  can  deter- 
mine the  values  of  the  flow  variables  at  the  nodal  point  (x,  y,  t + At) . Wfe 
repeat  this  process  for  each  interior  node  in  the  solution  dcmain. 


For  a boundary  node,  the  above  procedure  is  slightly  altered.  We  re- 
write the  given  boundary  condition (s)  at  the  boundary  node  in  terms  of  the 
interpolating  functions  (1).  We  then  solve  for  the  unknown  parameters  a-  in 

equation  (3)  at  the  boundary  node  subject  to  the  rewritten  boundary  conditions. 
Thus , at  a boundary  node  we  no  longer  have  a pure  minimization  problem  as  at  an 
interior  node,  but  rather  a constrained  minimization  problem. 

3.  PHOT  COMPUTER  PROGRAM.  The  pilot  computer  code  (see  Schmitt  [7]  for 
the  program  listing)  is  written  in  a modular  structure  fashion  in  order  to 
c^arl^'  *-^ie  l°9ic  of  the  program  and  to  allow  changes  in  the  form  of  the  govern- 
ing equations,  the  interpolating  functions,  the  method  selected  to  solve  the  non- 
linear system,  etc..  Such  flexibility  is  highly  desirable  for  this  pilot  code. 

For  example,  in  the  finite  difference  techniques,  the  formulation  of  the  equations 
nave  a profound  affect  on  a method's  performance  (see,  for  example,  Moretti  [8]) . 
Similarly,  the  form  of  the  equations  may  affect  the  performance  of  the  finite 
element  method. 


The  code  has  three  major  components.  The  first  component  accepts  the 
geometric  and  control  parameters.  Furthermore,  this  section  accepts  and/or 
generates  the  nodal  positions  within  the  x-y  solution  subdomain  and  necessary 
initial  and  boundary  value  data.  The  second  component  calculates  the  time  increment 
and  the  flow  variables'  values  at  each  node  at  the  new  tine.  The  "heart"  of  the 
pilot  code  is  clearly  the  second  component,  since  the  general  method  outlined  in 
section  2 is  implemented  in  this  portion  of  the  code.  The  third  component  provides 
the  output  and  graphics  capabilities  for  the  program.  The  code  uses  the  non- 
conservation inviscid  form  of  the  governing  equations  in  Cartesian  coordinates, 
where  body  forces,  heat  absorption  and  heat  fluxes  are  neglected,  the  Newton-Raphsor 
iteration  method  and  certain  approximations.  We  briefly  discuss  this  specific 
situation  below. 
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The  governing  equations  in  dimensional  variables  are: 


3p  . 3p  , 3c  , A ,3u  3v. 

3t  + U3^  + V37+p(^+37) 


= 0, 


(4) 


(5) 


,3v  , 3v  3v>  . 3p  - 

0(3t  + U3^+V  3yJ  + 3y  °' 


(6) 


,3e 

0 ( ■ 1 
P v9t-. 


?(p.  + |i) 

* 3x  3y 


= 0, 


(7) 


p - o (p,e)  . 


(8) 


Here  the  spatial  coordinates  are  x and  y,  t is  the  time,  u is  the  x-camponent 
of  the  velocity,  v is  the  y-ccnponent  of  the  velocity,  p is  the  pressure,  p is 
the  density  and  e is  the  internal  energy  per  unit  mass.  The  functional  form  of 
the  equation  of  state  is  given  by  equation  (8) . The  particular  equation  of  state 
used  in  a given  calculation  is  specified  in  one  of  the  program's  subroutines. 

_In_the  actual  calculations  for  a given  finite  element  at  the  point 
(x,  y,  t + At)  we  translate  the  origin  of  the  coordinate  system  to  the  point 
(x,  y,  t) . Conceptually,  the  translation  enables  each  finite  element  to  have  its 
base  centered  at  the  same  point  (x,  y,  t)  = (0,  0,  0)  and  practically,  it  simpli- 
fies the  calculations . The  interpolating  functions  for  the  variables  u,  v,  p,  e 
within  a finite  element  are: 


u(x,y,t)  = uQ(x,y,0) 

+ t • 

(a-jX  + a2y  + a3)  , 

(9) 

v(x,y,t)  = vQ(x,y,0) 

+ t • 

(a4x  + a5y  + a&) , 

(10) 

p(x,y,t)  = po(x,y,0) 

+ t • 

(a-jX  + a8y  + ag)  , 

(ID 

e (x,y,t)  = eQ(x,y,0) 

+ t • 

(a10X  + ally  + a12> ' 

(12) 

where  the  variables  x,  y,  t are  in  the  translated  finite  e latent,  the  variables 
with  subscript  zero  denote  the  variables  at  the  known  time  level  t = 0,  and  the 
a^'s,  i = 1,  2,  ...,  12  are  the  unknown  parameters.  We  define  the  dimensionless 

residuals  D^(x,  y,  t;  a),  k = 1,  2,  3,  4 as  the  quantities  obtained  by  substituting 

equations  (8)  - (12)  into  the  four  differential  equations  (4)  - (7) , respectively, 
and  then  by  dividing  the  first  result  by  a ratio  of  a characteristic  density  to 
At,  the  second  and  third  results  by  a ratio  of  a characteristic  velocity  to  At  and 
the  fourth  by  a ratio  of  a characteristic  internal  energy  per  mass  to  At. 


6 


The  values  of  the  characteristic  quantities  are  the  density,  sound  speed  and 
internal  energy  per  unit  mass  at  the  center  node  and  At  = 0.  The  derivatives 
of  p with  respect  to  the  variables  x,  y and  t can  be  obtained  by  the  chain 
rule.  The  subsequent  partial  derivatives  of  p with  respect  to  p and  e are 
calculated  directly  from  the  equation  of  state  and  are  coded  into  the  sub- 
routine associated  with  the  equation  of  state.  The  functions  u , v , p , e 

o o ro  o 

are  estimated  by  a linear  approximation  on  each  triangle  based  on  their  values 
at  the  vertices.  Although  other  approximations  are  possible,  the  present  one 
is  easily  computable  and  is  independent  of  the  number  of  prisms  in  the  finite 
element.  The  partial  derivatives  of  u , v , p , e , required  in  D.  (x,  y,  t;  a) 

O O O O K 

vary  from  triangle  to  triangle  and  consequently  equation  (3)  must  be  rewritten 
as 


axdydt  = 0 , i=l,2,...12,  (13) 


where  NQEQ  is  now  four  and  the  finite  element  contains  NUMTRI  prisms  (or 
triangles) . 


The  Newton-Raphson  method  is  used  to  solve  the  system  of  nonlinear 
algebraic  equations  (13).  Since  the  terms  D^_  (x,  y,  t;  a)  are  known  functions 

of  the  parameter  a,  the  second  partial  derivatives  of  the  least  squares  error 
can  be  explicitly  calculated  and  are  given  by 


The  initial  values  for  a^, a^,  a g,  a^?  in  the  Newton-Raphson  scheme  for  the 

finite  element  at  (0,  0,  At)  are  taken  as  the  previously  computed  values  of 
these  parameters  for  the  finite  element  at  (0,  0,  0)  . However,  at  the  first 
time  step  a special  calculation  is  needed  to  find  the  appropriate  initial 
values.  These  parameters  are  determined  by  solving  for  3/3t  in  each  of  the 
equations  (4)  - (7) , by  approximating  the  spatial  dependence  of  the  variables 
at  the  initial  time  by  linear  functions  on  each  triangle,  by  finding  the 
corresponding  values  of  3/3 t at  each  vertex,  by  equating  these  to  the  time 
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partials  of  equations  (9)  - (12) , by  averaging  the  resulting  values  of  the  a Vs 

over  the  triangles  and  by  translating  the  result.  In  all  cases,  the  initial 
values  of  a^,  &2>  ag/  a7 / ag»  a.^  and  a^  are  set  to  zero.  In  the  sample 

problems,  the  iterations  converged  faster  with  the  zero  values  than  with  the 

more  obvious  choice  ai  ? 0,  i ^ 3,  6,  9,  12.  The  intergrals  of  • 02DJc/3ai3a.)  , 

• (3D^/3a^)  and  (3D^/3a^)  • (SD^/Sa^)  are  approximated  by  a two  point 

Gaussian  quadrature  in  time  and  by  a product  of  first  order  functions  in  space. 

For  example, 


Iff 

l prism 


3Dk 

dxdydt  a 

j 


At 

T 


£[//? 

m=l  |_  Jt  triangle 


(x,y,tm;a) 


3°k 

X 3a“  (X,y,t,m;a)  ' 


(15) 


where  t , m = 1,2,  are  the  Gaussian  quadrature  points  between  zero  and  At. 

In  the  spatial  integrals  at  each  Gaussian  time,  both  factors  cure  approximated 
by  first  order  functions  in  x and  y which  are  then  multiplied  and  integrated 
exactly  over  the  desired  triangle.  Once  the  values  of  the  parameters  a^  are 

known,  the  desired  values  of  the  flew  variables  at  the  center  node  at  the  new 
time  (t  = At)  can  be  determined  easily  from  equations  (9)  - (12) . 

The  above  discussion  applies  equally  to  interior  and  boundary  type  nodes. 
However,  as  noted  at  the  end  of  section  2,  the  minimization  at  a boundary  node 
is  a constrained  minimization.  The  type  of  constraints  will  depend  on  the 
type  of  boundary  condition  imposed  at  a given  node.  As  an  example,  the  zero 
normal  velocity  boundary  condition  imposed  at  a solid  wall  is  incorporated  in- 
to the  equation  system  (13)  in  Appendix  A. 

4.  NUMERipVL  BCPERLMENTS.  In  this  section  the  results  of  two  non-steady  flow 
calculations,  the  flow  behind  a cylindrical  blast  wave  and  the  flow  across  a 
propagating  normal  shock,  are  presented.  These  two  examples  are  computed  in 
order  to  acertain  the  scheme's  characteristics  on  interior  nodes  for  a smooth  and 
discontinuous  flow,  respectively.  Both  of  these  time-dependent  flows  are 
essentially  one  dimensional  in  space;  however,  they  will  be  treated  as  two 
dimensional  problems.  Furthermore,  s^nce  closed- form  solutions  are  known  for 
each  problem,  the  accuracy  of  the  finite  element  formulation  can  be  precisely 
evaluated. 
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The  similarity  solution  of  the  blast  wave  problem  is  discussed  by 
Sedov  [9].  The  flow  behind  a cylindrical  blast  wave  rather  than  a planar  wave 
is  computed  because  of  its  associated  circular  solution  domain.  The  computa- 
tional domain  for  a given  time  consists  of  an  annulus  from  radius  r to  radius 

cl 

rb  which  is  the  position  of  the  shock  front  at  the  initial  time  t . Perhaps  the 

most  important  advantage  of  the  finite  element  method  is  its  capability  of 
dealing  with  complex  geometrical  shapes  by  using  arbitrarily  shaped  simple 
elements.  The  ease  with  which  this  cylindrical  problem  is  handled  by  the 
Cartesian  finite  element  program,  especially  the  circular  boundaries,  denr>n- 
strates  this  advantage  in  an  elementary  manner.  In  Figure  2,  the  first 
quadrant  of  a ccnputation  domain  for  a given  time  is  drawn,  where  r = 2.2  [m] , 

rb  = 3.0  [m] , the  nodes  are  equally  spaced  at  four  degree  intervals  on  a given 

circle  and  the  radial  divisions  are  computed  so  that  approximate  equilateral 
triangles  result.  Consequently,  the  size  of  the  triangles  increase  with  the 
radial  distance  from  the  origin.  Mote  the  good  approximation  to  the  arc 
boundaries  by  the  series  of  straight  lines  forming  the  base  of  the  triangles. 

A finite  difference  method  in  Cartesian  coordinates  either  would  approximate 
the  arc  boundaries  by  a series  of  horizontal  and  vertical  lines  or  would  re- 
quire a transform  of  the  solution  domain.  In  both  cases  special  treatment  of 
the  boundaries  would  be  required.  On  the  other  hand,  no  special  programming 
is  needed  to  treat  the  arc  in  the  finite  element  code. 

Numerical  calculations  were  performed  for  a cylindrical  blast  wave  [9, 
pp.  219-20]  generated  by  the  instantaneous  release  of  a finite  arrount  of 

energy  proportional  to  Eq  = 6.88^025656  x 106  [J/m]  into  a gas  with  initial 

density  = 1. 262523446  [kg/fri3,.  For  the  calculations,  the  specific  heat 

ratio  of  the  perfect  gas  is  y « 1.4,  the  radii  are  r = 2.2  m , r.  * 3.0  {mi, 

and  the  initial  time  is  tQ  = 3.85342034  x 10  [s] . On  the  initial  tire  plane 

the  values  of  flow  'variables  are  calculated  from  tie  exact  solution.  For  the 
inflow  conditions  at  r^  and  outflow  condition  at  r^  we  again  assign  the  exact 

values  to  the  flow  variables.  The  computed  ratios  of  p/pg,  v ^/v  , e/e  and 

o/os  (subscript  s denotes  value  at  the  shock  front)  are  compared  to  Sedov's 

exact  values  (solid  line)  in  Figures  3,  4,  5 and  6,  respectively.  The 
symbols  (X  and  X denote  the  computed  value  on  the  triangular  finite  element 
mesh  with  nodes  spaced  at  two  degree  intervals  (the  computed  radial  sub- 
divisions range  from  0.07  [m]  to  0.09  [m] ) and  at  four  degree  intervals  (the 
computed  radial  subdivisions  range  from  0.14  [m]  to  0.18  [m]),  respectively. 

Both  sets  of  values  are  at  the  same  tire  (4.12516608  x 10-3  [s] ) . The  former 

used  four  timesteps  (At*  0.68  x 10~4  [s] ) and  the  latter  two  timssteps 

(At  as  1.36  x 10  4 (s) ) to  reach  the  termination  time.  The  maximum  absolute 
value  of  the  percent  relative  error  for  the  ratios  of  pressure,  radial  velocity, 
internal  energy  and  density  are  1.15%,  0.12%,  0.10%  and  1.24%,  respectively,  for 
the  finer  mesh  and  2.68%,  0.30%,  0.36%  and  3.04%,  respectively,  for  the  coarser 
mesh.  We  recall  that  the  density  is  ccmputed  frcm  the  equation  of  state  once 
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NORMALIZED  PRESSURE 


rs  = 3.10397969  (m) 
ps  = 1.48919895  * 105  (Pa) 


Figure  3.  Comparison  of  Exact  Normalized  Pressure  for  Cylindrical  Blast 

Wave  [9]  with  Computed  Values  for  Nodes  Spaced  at  4®  (£)  and  2° 
(x)  Intervals. 
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NORMALIZED  RADIAL  VELOCITY  (vr/vrs) 


NORMALIZED  POSITION  (r/rs) 


rs  = 3.10397969  (m) 

vrj  = 3.  13520549  * 102  (m/s) 


Figure  4.  Comparison  of  Exact  Normalized  Radial  Velocity  for  Cylindrical 
Blast  Wave  [9]  with  Computed  Values  for  Nodes  Spaced  at  4a  ( X ) 
and  2°  (c*3)  Intervals. 


NORMALIZED  INTERNAL  ENERGY  (e/e5) 


NORMALIZED  POSITION  (r/rs) 

r s = 3.10397969  (m) 

es  = 4.91475674  x ]Q4  (J/kg  ) 


Figure  5.  Ccrtparison  of  Exact  Normalized  Internal  Energy  for  Cylindrical 

Blast  Wave  [0]  with  Computed  Values  for  Nodes  Spaced  at  4a  (X)  and  2° 
(cxi)  Intervals. 
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the  error  in  the  density  includes  both  the  pressure  and  energy  errors.  We  note 
that  the  maximum  relative  error  in  each  flow  variable  occurs  where  the  finite 
elements  are  the  largest;  that  is,  near  r.  . At  the  opposite  end,  near  r , the 

O cl 

finite  elements  are  the  smallest  and  the  absolute  value  of  the  percent 
relative  error  of  the  pressure,  radial  velocity,  internal  energy  and  density 
ratios  for  the  finer  and  coarser  mesh  are  0.11%,  0.01%,  0.02%,  0.14%  and  0.44%, 
0.18%,  0.19%,  0.64%,  respectively.  Hence,  overall  and  despite  the  relatively 
large  size  of  the  triangular  elements,  the  agreement  is  fairly  good. 

A normal  shock  of  strength  five  ^/Pg  ~ 5)  propagating  into  a rectangular 

field  is  the  second  calculation.  The  subscripts  1 and  2 on  the  variables 
denote  their  value  in  the  pro-  and  post-shock  states,  respectively.  The  flow 
domain  consists  ideally  of  the  infinite  slab  of  the  (x,  y)  plane, 

- oo  < x < co , Y2  1 y L Yi  f°r  time  greater  than  the  initial  time  t . The 

shock  is  initially  at  the  position  y , < ys  < y^,  and  is  moving  in  the 

o o 

positive  y-direction.  The  quiescent  state  is  characterized  by  zero  velocities 
(u^  = v^  = 0 [mV's])  and  by  pressure  (p.  = 1.01325  x 106  [Pa])  and  density 

3 

(e^  = 1.225570786  [kg/m  ])  of  air  (specific  heat  ratio  y = 1.4)  at  sea 

level  and  at  temperature  288.15°  [K] . The  values  of  the  flow  variables  in 
the  disturbed  region  are  computed  by  the  Rankine  Huaoniot  relations  [10 
pp.  62-65] . The  nodes  are  equally  spaced  along  constant  y values  and  the  y 
directed  divisions  are  computed  so  that  approximate  equilateral  triangles 
result.  The  computational  domain  for  a given  time  is  very  similar  to  that 
in  Figure  2 except  that  the  radial  and  angular  variables  now  correspond  to 
the  y and  x variables,  respectively.  However,  all  the  triangles  in  this  case 
are  the  same  size.  On  the  initial  time  plane  t , the  values  of  the  flew 

variables  are  calculated  from  the  exact  solution.  The  boundaries  at  y^  and 

are  taken  sufficiently  far  away  from  the  shock  front  so  that  the  constant 
values  in  both  the  disturbed  and  quiescent  regions  are  obtained  by  the  flow 
variables . 

For  the  calculations,  the  y directed  subdivisions  are  small  in  order  to 
model  the  shock  with  its  extremely  thin  thickness.  The  y directed  divisions 
are  0.005  [m] , y_  = 2.965  [m] , y.  = 3.050  [m] , y = 3.0025  [m]  and 

So 

tQ  = 0.0  [ s ] . The  graphs  of  the  computed  ratios  for  the  pressure  (p/ps) , the 

y-velocity  (v/vg) , the  internal  energy  (e/e  ) and  the  density  (p/pg)  versus 

the  y-axis  are  given  in  Figures  7,  8,  9,  10,  respectively,  for  three  times; 

1.1524902  x 10  4 [s]  (fourth  time  level),  2.5986071  x 10  4 [s]  (ninth  time 

-4 

level)  and  4.0654310  x 10  [s]  (fourteenth  time  level).  The  symbol  c><  denotes 

the  theoretical  position  of  the  shock  front.  The  Figures  7-10  shew  fair 

resolution  of  the  locations  of  this  shock  front  for  the  given  times.  Spurious 

oscillations  develop  at  the  shock  front  as  expected,  since  no  artifice  was 

introduced  in  the  equations  or  scheme  to  suppress  them.  The  oscillations  occur 

about  the  exact  solutions  p/p  = v/v  = e/e  = p/p  = 1 in  the  disturbed  flow 

c cs  s s s 


= 1.1524 902  x 10-4( » ) r = 2.  5986071  *10'4  (s)  r =4.06543)0  *l(T4(s) 


Figure  7.  Variation  of  Normalized  Pressure  with  Distance  for  Normal  Shock 
of  Strength  Five  at  Three  Times. 


1524902  * 10"4  (s)  r = 2.5986071  xlO'4  ($)  t = 4.0654310  x 10'4  (s  ) 


= 1.1524902  * 10~4  (s ) t = 2.5986071  * 10~4  (s ) r = 4.0654310  * 10"4  (s) 


Variation  of  Normalized  Internal  Energy  with  Distance  for  Normal 
Shock  of  Strength  Five  at  Three  Times. 


region.  A major  objective  of  future  work  is  to  curtail  these  oscillations 
(see  section  5) . 

We  close  this  section  with  a discussion  of  the  calculation  times.  The 
times  needed  to  calculate  the  flow  variables  both  for  an  entire  time  level  and 
at  an  individual  interior  node  are  given  to  help  ascertain  the  time  character- 
istics of  the  scheme.  All  the  calculations  were  done  on  the  BRLESC  computer 
facility  at  the  Ballistic  Research  laboratory.  The  calculations  of  the  flow 
behind  a cylindrical  blast  wave  took  approximately  1.3  and  0.5  minutes  for  the 
finer  and  coarser  grids,  respectively,  of  run  time  per  time  level  and  approxi- 
mately eight  seconds  per  interior  node  for  both  grids.  These  times  include, 
on  the  average,  two  iterations  per  node  in  the  Newton-Raphson  method.  The 
shock  propagation  calculations  took  approximately  4.5  minutes  per  time  level 
and  approximately  16  seconds  per  interior  node.  The  shock  propagation  problem 
typically  required  four  iterations  per  node  in  order  to  find  convergent  values 
of  the  unknown  parameters  a^.  Although  the  above  times  are  not  as  small  as  one 

might  hope,  they  can  be  reduced  substantially  by  simple  modifications  of  the 
algorithm  (see  section  5) . Such  simple  optimizations  must  be  investigated  in 
future  work. 


5.  SUMMARY.  We  have  presented  a numerical  scheme  for  solving  time  de- 
pendent  hyperbolic  equations  in  two  space  dimensions.  This  method  merges 
the  concepts  of  the  finite  element  method  and  the  properties  of  hyperbolic 
systems  of  equations  and  is  applied  to  unsteady  gas  flows.  The  essential 
features  of  the  corresponding  hydrocode  are  briefly  discussed.  Finally, 
numerical  calculations  involving  both  smooth  and  shocked  flows  are  given  and 
discussed. 

The  purpose  of  this  report  is  to  give  a summary  of  the  work  already 
accomplished  rather  than  a definitive  description  of  the  quality  and  useful- 
ness of  this  numerical  scheme.  However,  several  positive  aspects  of  the 
method  can  already  be  seen.  The  formulation  of  the  method  is  straight- 
forward and  avoids  the  large  matrices  associated  with  the  finite  element 
method.  The  method  handles  different  geometrically  shaped  boundaries  easily 
and  accurately.  Even  for  relatively  large  mesh  sizes,  the  results  for  a 
smooth  flow  are  accurate.  Finally,  in  a propagating  shock  problem,  the 
shock's  position  can  easily  be  discerned. 

From  the  example  calculations  in  section  4,  it  is  clear  that  improve- 
ments must  be  made  in  several  areas  before  the  method ' s potential  can  be 
accurately  ascertained.  Listed  below,  not  necessarily  in  order  of  importance 
or  difficulty,  are  several  such  areas. 

a.  Curtail  the  spurious  oscillations  near  shocks.  Several  methods; 
such  as  the  flux-corrected  transport  techniques  of  Boris,  et  al,  [11]  and  the 
well-known  artificial  viscosity  method  (see  Roache  [12]) , can  be  applied. 
Although  the  latter  is  simpler  to  use,  the  former  technique  may  hold  more 
premise  because  the  oscillations  do  occur  about  the  correct  solution  and  the 
flux  correction  will  not  significantly  reduce  the  resolution  of  the  shock  as 
does  artifical  viscosity. 
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b.  Shorter  corrputing  time.  The  run  time  can  be  significantly  reduced 

by  simple  alterations  in  the  algorithm.  Recall  that  the  spatial  derivative 

of  the  variables  at  the  known  time  level  (u  , v , p , e ) were  computed  on 

o o o o 

every  triangular  base  of  the  finite  element  which  resulted  in  the  repeated 
calculation  (up  to  NUMTRI  times)  of  the  residuals  and  their  partial 

derivatives  at  the  nodes  (see  equations  (13)  - (14) ) . If  these  derivative: 
at  the  nodes  were  computed  once  for  a given  finite  element,  the  tine  re- 
duction would  be  by  a factor  of  0.5.  The  run  tine  could  be  farther  redu 
by,  at  least,  an  order  of  magnitude  if  the  calculations  were  done  on  a 
machine  similar  to  the  GDC  7600. 

c.  Rerun  examples  with  different  equation  formulations.  Certain 
formulations  may  increase  the  accuracy  of  the  computed  results  and  decrease 
the  oscillations  due  to  shocks. 

d.  Apply  the  method  to  an  actual  numerical  problem.  By  applying  the 
method  to  a problem  with  complicated  boundaries , not  only  could  the  method ' - 
treatment  of  boundaries  be  tested,  but  also  the  entire  method  could  be  car- 
pared  to  another  numerical  solution  technique. 

e.  Extend  the  method  with  respect  to  "infinite"  strength  shocks  and 
moving  boundaries.  Once  these  extensions  are  incorporated  into  the  method, 
this  scheme  could  hopefully  be  used  to  finally  develop  an  adequate  model  of 
the  severe  transitional  ballistics  environment. 
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APPENDIX  A 


BOUNDARY  NODE  FORMULATION  FOR  A ZERO  NORMAL  VELOCITY 


BOUNDARY  CONDITION 

Let  Q be  a node  at  a stationary  wall  where  the  tangent  is  defined.  The 
normal  velocity  being  zero  at  Q implies  that 


u(x,y,t)  sina  - v(x,y,t)  cosa  = 0, 


where  (x,  y)  are  the  spatial  coordinates  of  node  Q and  a is  the  angle  between 
the  tangent  line  at  Q and  the  x-axis.  Since  the  wall  is  stationary  a is  a 
constant  and  equation  (Al)  holds  for  all  times.  By  translating  the  axis  to 
(x,  y,  t) , (the  plane  corresponding  to  t is  the  initial  plane) , and  using  the 
interpolating  functions  for  u and  v,  equations  (9)  and  (10) , respectively, 
equation  (Al)  beccrres 

£uq(0,0,0,)  + a3t  J sina  -£vq(0,0,0)  + agtjcosa  = 0 ,0<t<At.  (A 2) 

Since  the  solid  wall  is  stationary,  equation  (A2)  reduces  to 

a^  sina  - a&  oosct  = 0.  (A3) 

To  minimize  the  least  square  residual  error  over  the  finite  element  at  the 
point  (x,  y,  t,  At)  subject  to  equation  (A3)  , we  use  Lagrange  multipliers  to 
obtain 


NUMTRI 


E 


dxdydt 


1=1  l prism 


+sina 


’tOTRI  4 3n 

E / / d*dydt 

_ 1=1  i prisn 


a^  sina  - a^  cosa  = 0. 


23 


i of  the  residual  error  over  the  boundary  type  element 
il  velocity  is  imposed,  equations  (A4)  and  (A 5)  replace 
= 0 and  3E(a)/3ar  = 0 of  equation  system  (13) , 

D 


LIST  OF  SYMBOLS 

vector  of  unknown  parameters 
i^1  component  of  vector  a 

internal  energy  per  unit  mass  [J/kg] 
pressure  [Pa ] 

polar  radial  coordinate  [m] 

radii  of  annular  region  for  blast  wave  calculation  [m] 
time  [s] 

given  value  of  time  [s] 
time  increment  [s] 

Gaussian  quadrature  points  used  on  time  integration  [s] 

velocity  component  in  x direction  [m/s] 
velocity  component  in  y direction  [m/s] 
velocity  component  in  the  radial  direction  [m/s] 

Cartesian  spatial  coordinate  [m] 
given  value  of  the  x coordinate  [m] 

Cartesian  spatial  coordinate  [m] 
given  value  of  the  y coordinate  [m] 
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D^x,  y,  t;  a) 

E(a) 

Fi(a) 

NQEQ 

NIMTRI 

V 

p(x,  y,  t) 
u(x,  y,  t) 

Subscripts 

0 
s 

1 
2 


nondimensional  residual  error  from  the  ku  differential 
equation  at  a point  (x,  y,  t) 

total  residual  least  squares  error  over  a finite  element 
the  first  partial  derivative  of  E(a)  with  respect  to  a. 

number  of  differential  equations  to  be  solved 
s imul taneous ly 

number  of  prisms  composing  a finite  element 

3 

volume  of  a finite  element  [m  ] 

3 

density  [kg/m  ] 
generic  flow  variable 


corresponds  to  known  value  at  a given  time 
corresponds  to  value  at  shock  front 
corresponds  to  value  in  the  pre-shock  state 
corresponds  to  value  in  the  post-shock  state 
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THE  NUMERICAL  SOUVriON  OF  AXISVMMETRIC  FREE 
BOUNDARY  POROUS  FLOW  WELL  PROBLEMS 


Colin  W.  Cryer  and  Hans  Fetter 
Mathematics  Research  Center.  University  °f  Wisconsin.  Madison,  Wisconsin 


1.  Introduction 


The  steady  state  problem  to  be  considered  is  shown  in  Figure  1.1.  An  antisymmetric 

well  of  radius  r is  sunk  into  a layer  of  soil  of  depth  H and  radius  R . The  bottom 

of  the  soil  layer  is  impervious.  The  outer  boundary  of  the  soil  adjoins  a catchment  are 

and  the  hydraulic  head  u is  equal  to  the  constant  H along  this  boundary.  The  water 

seeps  towards  the  well  and  a pump  (not  shown)  maintains  the  water  level  in  the  well  at 

a constant  height  . The  water-air  interface  is  a free  boundary  v.-hicb  into  sects  the 

well  wall  at  a height  h 
3 s 

The  mathematical  problem  c_n  now  be  formulated  as  follows  (see  Hantush  (1964), 
Bear  (1972],  and  Cryer  (1976,  p.  36)): 

Prob) cm  7 

rind  functions  v>  (x)  (the  height  of  the  free  bound. \y)  and  u(;:,y)  (the  hydras' 
head)  such  that  (from  the  equation  of  continuity  arid  Darcy's  law): 

div (k  grad  u)  ~ ~ (k  ~ ) + — (k  ?—  ) = 0,  in  ft  , (1.1) 

dx  3x  dy  dy 


together  with  the  boundary  conditions. 


u = 

H. 

on 

ab(T  > , 

3n 

= o. 

on 

BC(F4), 

u = 

h , 
w 

on 

cd(I'2)  , 

u = 

y. 

on 

de(t^)  , 

u » 

y. 

on 

EA{rQ), 

3n 

■=  0 

f on 

i ea (r  ) 

(1.2) 

(1.3) 

(1  .4) 

(1.5) 

(1.6) 

(1.7) 
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PRECEDI-G  PAGE  BLc.NK-NOT  FILMED 


Here,  is  the  (unknown)  domain, 


f!  = { (x,  y ) : 0 < y < (x)  , r<x<R  } , 

and  denotes  the  outward  normal  derivative.  Finally,  k = k(x,j 

permeability  of  the  soil  is  denoted  by  k = ie(x,y)  . It  is  assumed,  t‘  it 
the  form 


k (x,y)  = exp | f (x)  + g(y)  ] 

where  f(x)  and  g(y)  are  continuously  differentiable  and 

g’  (y)  2 0 • 11 

In  particular  if  the  per.  -al  ility  > is  coi  .taut,  i 1 say,  t: 


k(x,y)  = x cxp[fn  x] 


so  that 


IX) 


Ever  sin  e C.  Hai 
the  numerical  i .nt  of  vr.-w  • 
problems  relate)  » . 1 


g(y)  - 0 . (i . ! 

' a itho-a  .1  i ally  ri  qoi  out  and  free 
■i  *i  to  1 he  solution  of  various  free  L.  jnJ  . 


1 1 m>  It.-  , numerous  stud i<  - have  . : . - i 

in  the  literature  extci  • . • direct i utis  Cryer  19 


Brezzi,  and  Coraiiiciol  l (l.'.'O)  ,iv-  liblx  iraj.'.nc  . 

The  basic  idea  introduced  by  baiocelu  , can  be  summarized  as  follows:  Through 
suitable  change  of  the  unknown  variable,  the  free  boundary  problem  is  redo.  1 u>  th  - 
minimizing  a quadratic  function. *1  on  a closed  -onvex  set.  This  ref. ,ir  . itm.-,  c:  ti.. 
not  only  enables  one  to  determine  various  properties  of  the  solution,  but-  it  ,tl  :.  • 

the  advantage  that  the  new  problem  can  readily  bo  solved  numerically  by  rcvii  ,1  n-e 
including  finite  differences  and  finite  elements. 

Here,  we  apply  an  extension  of  Baiocchi's  work  due  to  Benci  [1973,1974].  A more 
complete  account  of  our  work  will  be  found  in  Cryer  and  Fetter  [1977]. 
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2.  Formula t ion  as  a variational  inequality 

In  this  section  we  follow  Boncl  (1973,  1974)  and  reformulate  Problem  A,  as  a 
variational  inequality. 


The  first  stop  is  to  introduce  a "Baiocchi  function"  w(x,y)  defined  on  the 
rectangle 


D = { (x , y ) : r < x < R,  0 < y < ll} 


(2 


as  follows: 


<f  ( x ) 


w ( x , y > - \ 


/ exp(<j(l))  (u(x,l)  - tjdt,  for  (x,y)  f 1.'  , 

y 


(2 


0,  for  (x,y)  > D - ft  . 


The  values  of  w on  the  boundary  of  D can  be  given  explicitly:  for  a special  case 
see  (3.8) . 


We  use  the  following  function  spaces  (see  Adams  119751):  C(D),  the  space  of 


functions  which  are  continuous  on  D,  equipped  with  the  supremum  norm;  H (D) , the 


Sobolev  space  of  weakly  differentiable  functions  on  D,  winch  is  sometimes  denoted  by 


.1  , 


H 1 ' 2 ( D ) or  Wl'‘(D),  and  which  is  equipped  with  the  norm  |]  * ||  ^ : and  H*(D),  the 


subspace  of  (D)  consisting  of  those  elements  of  H^(D)  which  "vanish"  on  the 


boundary  of  D. 

Let  h be  the  bilinear  operator  defined  on  (D)  X H (D)  by, 


i(u,v)  w j exp[f(x)  - g(y)]grad  u grad  v dxdy  , 


(2 


= / expjf(x)  - g(y)]  (u  v + u v Jdxdy 


xx  y y 


Let  j be  the  linear  functional  defined  on  H (D)  by, 


j(v)  = / exp(f(x)]v  dx  dy 


(2 


Lot  K be  the  closed  convex  set 


K = {v  c H1(D):  v-w  e H*  (L')  and  v _>  0 a.c.  in  n)  . (2.5) 

Then  Benci  [1973,  1974]  proves 
Theorem  2 ■ 1 

If  u is  a solution  of  Problem  A and  g*  (y)  > 0 then  w e H 1 (D)  n C(D)  and 
w satisfies  the  variational  inequality:  Find  w e K such  that 

a(w,v-w)  + j (v-w)  >_  0 , (2.6) 

for  all  v e K.  0 

It  follows  from  the  basic  theory  of  variational  inequal i t ies  (Stampacchia  [19641) 

that 

Theorem _2 . 2 

There  exists  a unique  solution  we  H*(D)  of  the  variational  inequality  formula- 
tion (2.6)  of  the  axisymmetr ic  well  problem.  [] 

3.  numerical  ippioximntion 

It  has  been  shown  in  the  previous  section  that  the  Baiocchi  function  w satisfies 
the  variational  inequality  (2.6):  Find  w c K such  that  for  all  v e.  K , 

a(w,v-w)  + j (v-w)  >_  0 . (3.3) 

Since  a is  syinmetri  c , that  is 

a(v,w)  = a(w,v),  for  all  v,w  c V , 

there  is  a connection  between  the  variational  inequality  (3.1)  and  the  unjjuti  t.il  mini- 
mix  at  ion  prohl cm 

Min  J (v)  , 
vc  K 

(3.2) 

J (v)  - a (v,  v)  + 2j  (v)  . 

This  connection  is  given  by  the  following  theorem  (Lions  (1971,  p.  91): 

Theorem  3 . 1 

Let  a(v,w!  be  a symmetric  coercive  bilinear  form.  Then  w is  a solution  of  the 

variational  inequality  (3.1)  iff  w is  a solution  of  the  unilateral  minimization  problem  (3.2). 

□ 


31 


T 


approx  imate  w by  choo;  iny  a f ini  t e- di  imtisional  approximation  and  solving 


-dim#  visional  problem:  Find  w.  < K, 

1 h h 


J(wh)  Min  Jfv^)  • 

v,.K, 

1)  h 


(3.3) 


onvi-x  et  Kf)  is  constructr  d as  follows.  Tlie  domain  D is  triangulated 
1 n Fi gut  i 3.1. 


Figure  3.3:  Tile  tri  angul  at ' on  of  D 

ii  visions  are  not  necessarily  uniform,  but  it  is  assumed  that  there  is  a constant 
such  that 

i (maximum  interval  lengtli  < h < 0 (minimum  interval  length)#  (3.4) 

p — 

h is  a measure  of  the  fineness  of  the  subdivision.  The  set  of  interior  gridpoints 

1>-  denoted  by  D and  the  net  of  boundary  gridpoints  will  be  denoted  by  3r> 

h h 

We  denote  by  V the  space  of  piecewise  linear  functions  (linear  finite  elements) 
h 

iiesponding  to  the  triangulation  in  Figure  3.1.  Wo  set 


K = {v.  c V,  : v,  > 0 in  D and  v,  = $ on  ?D,  } . 

h h h h — h h 


(3.i^ 


The  approximation  wh  is  readily  computed.  We  can  derive  an  error  estimate  for 
by  comb i n i ng  the  ideas  of  Brezzi  and  Sacchi  (to  appear)  and  Brezzi,  Hager,  and 
; vi art  (to  appear): 
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- 


The  integer  m was  always  chosen  to  he  a multiple  of 


<1 


so  that  the  corner  D was  a 


griilpoirvt;  this  was  nitv.ir.ahlo  since  w is  not  smooth  at  the  corner  D . 

The  permeability  k was  taken  to  bo  one  so  that  f(x)  = In  x and  y(y)  = o 
From  (2.3),  (2.4),  and  (3.2), 


J (v)  = / x[v?  + v2  + 2v)dxdy 

x y 


The  boundary  values  of 


w(x,H)  = 0,  on  AF  , 

? 

w(R,y)  = (ll-y)  /2,  on  AP  , 

7 

w ( r , y ) (li  -y)  /2,  on  CP 

w 


(3.7) 


(3.8) 


w ( r , y ) --  0,  on  DF  , 

h In  ( i ) ( !i  In  (x/r  ) 

w(x,P!  — — , - , on  PC  . 

/ In (R/i ) 

The  approximation  was  computed  usinq  S.O.R.  with  projection  (Cryer  (1971), 

Glowinski  (1971)). 

The  com]  station-  presented  no  di  I f j e jl  t i <t  . The  sole  ion  of  the  small  e i problem 
is  giv.  i in  Table  3.1. 

Tn  Table  3.)  the  position  of  tile  approximate  free  bound,  y i:  : lu  vn  by  the  firs. 

zeio  tetin  in  each  column.  Hie  approxii'  de  solution  is  identically  zero  on  tl,  vertical 

line  x = r so  that  it  is  not  possible  to  determine  the  Ik  i iht  h at  whirli  t’.,c  Ivee 

boundary  intellects  the  well.  As  an  appro  :imation  to  b we  tale*  the  h-  let  h*  of 

s s 

the  free  bimn  Liy  .t  the  vertical  gridline*  adjacent  to  the  v*  11.  For  cxai.pU  , front 

Table  3.1  wr  old.  in  h'  = 3G. 

s 


For  comparison,  w.*  co.14  ar»  in  Table*  3.2  the  values  of 
authors:.  With  the  exception  of  the  present  computation,  all 
graphically  so  that  we  have  had  to  estimate  h from  graph;; 


h,  obtained  iy  different 
the  results  av«-  presented 


1 


Author 

Me  t hod 

Ii 

s 

Hall  1 1955 , l'.  29] 

trial-f  roc. -boundary ; 
finite  differences 

34.0 

Taylor  and  Luthin  [1969] 

time-dependent ; finite 
differences 

34.0 

Neuman  and  Wi t her spoon 
[1970] 

trial  free-boundaiy ; 
finite  element., 

30.0 

Neuman  and  Witherspoon 
[1971,  p.  620] 

time-dependent;  finite 
di f fereneos 

30.0 

Present  (m=64,  n=36) 

Variation.!’,  incej1.  1 i t i • 

30.0 

The  differences  in  Table  3.2  may  be  explained  by  the  fact  that  the  physical  assumptions 
differed:  Hall  (1955]  assumed  capillarity  and  a lined  well;  Taylor  and  Luthin  (1969] 

assumed  partially  saturated  flow;  and  Neuman  and  Witherspoon  (1970,  1971]  made  the 
same  assumptions  as  in  the  present  paper. 

I 
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DIRECT  AND  ITERATIVE  ONE  DIMENSIONAL  FRONT  TRACKING  METHODS 
FOR  THE  TWO  DIMENSIONAL  STEFAN  PROBLEM 


Gunter  H.  Meyer 
School  of  Mathematics 
Georgia  Institute  of  Technology 
Atlanta,  Georgia  30332 


ABSTRACT.  An  application  of  the  method  of  fractional  steps  and  of  the 
method  of  lines  to  general  two  dimensional  Stefan  type  problems  is  described. 

The  numerical  aspects  are  discussed  and  the  performance  of  both  methods  for  a 
model  problem  is  compared. 

1.  INTRODUCTION.  Two  and  three  dimensional  free  surface  problems  for 
partial  differential  equations  lead  to  many  of  the  same  computational  diffi- 
culties observed  with  nonlinear  boundary  value  problems  on  fixed  domains.  The 
data  management  demands  and  long  computing  times  required  for  the  iterative 
solution  of  nonlinear  problems  place  a premium  on  the  efficiency  of  the  nu- 
merical method  itself  and  on  its  implementation  on  a given  computer  installation. 
If  a general  class  of  problems  can  be  identified  which  must  be  solved  repeatedly, 
then  a careful  design  of  a computer  algorithm  and  its  code  (and  the  concomitant 
expense)  is  justified.  If,  however,  nonstandard  problems  are  to  be  solved  then 
an  ad-hoc  method  with  modest  programming  demands  may  be  attractive  even  if 
computing  times  are  not  minimized.  In  our  view  free  surface  problems  may  be 
considered  nonstandard  because  of  the  variety  of  differential  operators  and 
boundary  conditions  which  arise  in  practice;  and  locally  one  dimensional  methods 
are  advocated  as  solution  techniques  precisely  because  they  place  modest  demand 
on  mathematical  training  and  programming  skill.  They  may  not  be  competitive 
in  execution  time  with  more  sophisticated  techniques  (such  as  multi  dimensional 
finite  element  methods),  but  such  differences  blur  when  weighted  against  the 
accounting  practice  for  man  and  machine  of  any  given  computer  center. 

Two  distinct  locally  one  dimensional  methods  have  been  examined  so  far  [3], 
[4],  [5].  A naive  version  of  a fractional  step  method  (a  special  alternating 
direction  method)  is  particularly  simple  to  apply.  It  is  effective  in  certain 
geometries  and  reasonably  efficient  computationally.  An  alternate  approach  is 
via  the  method  of  lines  coupled  with  an  iterative  solution  procedure.  It  is 
advocated  for  some  geometries  where  the  fractional  step  method  cannot  immediately 
be  applied.  While  this  method  is  also  straightforward  to  explain  it  appears 
computationally  less  efficient  for  problems  requiring  a fine  spatial  resolution. 
In  this  paper  both  methods  are  presented  for  a general  class  of  free  boundary 
problems  in  the  plane.  More  general  differential  operators  and  different 
boundary  conditions  are  considered  here  than  in  the  earlier  papers.  Moreover, 
a first  comparison  is  made  between  both  methods  in  the  solution  of  a realistic 
problem  when  each  one  dimensional  problem  is  solved  with  the  front  tracking 
invariant  imbedding  method. 
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2.  THE  METHOD  OF  FRACTIONAL  STEPS  FOR  FREE  SURFACE  PROBLEMS.  Primarily 
in  the  Russian  literature  the  method  of  fractional  steps  has  found  favor  as  a 
straightforward  solution  technique  for  multi  dimensional  boundary  value  problems. 
An  exposition  of  this  method  and  selected  applications  may  be  found  in  the 
monograph  of  Yanenko  [8].  Moreover,  it  is  known  that  the  method  is  valuable 
for  certain  nonlinear  problems  for  parabolic  equations,  among  them  the  classical 
Stefan  problem  in  its  enthalpy  formulation  which  eliminates  the  need  to  track 
a priori  unknown  free  surfaces  [1],  [7].  However,  in  many  applications  the 
differential  equations  and  the  boundary  conditions  make  it  very  doubtful  that 
fixed  domain  transformations  similar  to  the  enthalpy  transformation  exist  at 
all  so  that  it  may  become  necessary  to  actually  track  the  free  surface  as  a 
function  of  time.  It  now  seems  natural  to  combine  one  dimensional  front 
tracking  routines  with  the  method  of  fractional  steps  to  resolve  such  problems 
via  a sequence  of  (usually  well  understood)  one  dimensional  problems.  It  has 
been  demonstrated  in  [3]  that  an  effective  and  transparent  solution  method  for 
quite  general  free  boundary  problems  results.  We  shall  give  here  an  exposition 
which  somewhat  parallels  that  of  [3];  however,  more  general  equations  and  flux 
boundary  conditions  rather  than  Dirichlet  data  are  chosen  to  illustrate  the 
versatility  of  the  method. 

The  problem  to  be  considered  will  be  written  as 

(2.1)  !,u  = V*kVu  + a*7u  + bu-cu^  = f,  (x,v)  e D(t),  t>0 

(2.2)  u (x, y ,0)  = ur) (x, y)  , (x,y)  e D(0) 

where  the  data  functions  k,  a = (a^ao),  b,  c,  and  f are  assumed  to  be  smooth 
functions  of  x,  y,  and  t.  The  initial  domain  D ( 0 ) is  given;  it  evolves  into 
the  general  domain  D(t)  whose  boundary  in  part  is  unknown.  Specifically,  we 
shall  assume  that  the  boundary  3D(t)  of  D(t)  consists  of  two  parts  3Dj(t)  and 
3D2(t)  where  3Di(t)  is  a given  curve  including  its  end  points,  and  where  3DT(t) 
is  the  free  boundary.  Quite  general  domains  and  boundary  shapes  for  3Dj(t) 
can  be  treated.  However,  to  simplify  the  exposition  and  to  aid  in  visualizing 
the  geometry  we  shall  assume  that  D(t)  consists  of  a rectangle  [0,X]  x [0,Y] 
whose  upper  right  hand  corner  has  been  cut  off  by  the  free  surface  3D2(t). 

The  numerical  example  of  section  4 uses  this  geometry  (see  Fig.  1).  The 
following  assumption  is  essential  for  the  applicability  of  the  fractional  step 
method : 

Hypothesis:  The  free  surface  has  no  horizontal  or  vertical  tangents. 

This  hypothesis  implies  that  3D2(t)  at  any  time  t can  be  expressed  in  parametric 
form  as 


3D2 (t ) = {(x,s2(x, t) )) 


or  3D2(t)  = ( (si(y, t) ,y)} 

where  x = S](y,t)  and  y = S2(x,t)  are  inverse  functions  of  each  other.  For 
Dirichlet  data  the  end  points  of  3Dj(t)  are  given,  while  for  more  general  flux 
conditions  they  must  be  calculated.  The  initial  shape  3D2(3)  is  assumed  given. 
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On  3Dj(t)  a reflection  condition  of  the  form 
(2.3)  -k^-  = nu  + a (x, y)  £ 3Dj(t) 

will  be  imposed  where  again  k,  q and  a are  functions  of  x,y,  and  t.  Since  it  is 
intended  to  apply  invariant  imbedding  for  each  locally  one  dimensional  problem 
the  specification  of  the  data  for  the  free  boundary  3D2(t)  can  be  kept  quite 
general.  Indeed,  as  indicated  in  [3],  the  boundary  conditions  on  3D2(t)  are 
important  only  in  so  far  as  they  lead  to  an  orderly  evolution  of  3D2(t)  obeying 
the  above  hypothesis.  Accordingly,  we  shall  write 


(2.4) 

(2.5) 


u = g0(x,y,t) 


kVu  = (gi(x,y,-^,t) , g2(x,y,^,t)) 


’dt’ 


(x,y)  e 3D2(t) 


(A  specific  example  of  a heat  flow  problem  with  space  dependent  energy  input  on 
the  free  surface  is  presented  below  to  illustrate  the  meaning  of  these  conditions.) 

In  the  method  of  fractional  steps  the  equation  (2.1)  is  written  as 


(2.6) 


Lu  = Lju  + L2u  = f 


where 


Liu  = (ku  ) + aiu  + kbu-^cu 

1 X X 1 X t 


and 


L2u  = 


(ku  ) + a2u  + ^bu-^cu  , 

yy  * y t 


and  its  solution  is  advanced  over  a discrete  time  step  of  length  At  by  integrating 
each  one  dimensional  equation  Lju  = *jf  and  L2u  = *jf  over  successive  half  steps. 
Specifically,  suppose  that  u = u(x,y,t  ) is  known  over  the  given  domain  D(t  ). 
Then  the  solution  u and  domain  D(tn+J^)  at  the  time  level  t = t^  + ^At  = t^ | , n 
are  found  from  the  one  dimensional  problem  ” 


(2.7)  Lju  = ^f  (x,y)  e D(t),  t e C tn» 

k(x,y,  t)  ^ = n(x,y,t)u  + a(x,y,t),  (x,y)  e SD^t) 

u = gofx.y.t)  (x,y)  e 3D2(t) 

kux  = gi(x,y,^r,t) 

with  u = u(x,y,t  ) as  given  on  the  known  domain  D(t  ).  In  order  to  solve  (2.7) 
the  free  boundarynis  given  as  x = s2(y,t).  Throughout  these  equations  the 
variable  y is  considered  a parameter.  (Numerically,  of  course,  this  one  di- 
mensional problem  must  be  solved  for  representative  values  of  y.) 

Once  u(x,y,t  ,,)  and  D(t  ,,)  have  been  found  they  serve  as  initial  values 

for  the  second  step  in  which  th?  evolution  of  u and  D(t)  over  (t  ,,  , t ] is 

11+^5  n+1 
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completed  by  solving 

(2.8)  L2u  = *sf  (x,y)  e D(t),  t e (tn+^,tn+1J 

k(x,y,t)^-  = n(x,y,t)u  + a(x,y,t)  (x,y)  e 3Dj(t) 
dy 

u = go (x,y, t)  (x,y)  e 3D2(t) 

kuy  = g2(x,y,-|^,t)  , 

again  with  u(x,y,t^  |;^)  as  computed  on  D(tn+^). 

Note  that  if  in  (2.7)  and  (2.8)  the  lines  of  constant  y and  x do  not 
intersect  the  free  boundary  3D2(t)  then  the  one  dimensional  operators  Lju 
and  L2u  are  subject  to  standard  flux  two  point  boundary  conditions.  Thus 
the  above  formulation  includes  standard  diffusion  problems  on  fixed  domains. 
Alternately,  one  may  visualize  3D2(t)  as  given  and  the  two  conditions  (2.4,5) 
replaced  by  a single  nonlinear  relation  between  u and  Vu.  In  this  way  nonlinear 
boundary  data  on  selected  portions  of  the  boundary  can  be  accommodated.  We  shall 
not  pursue  this  point  here. 

An  analysis  of  the  analytical  fractional  step  method  (2.7,8)  as  At  + 0 for 
general  free  surface  problems  is  not  available  at  this  time.  However,  the  ex- 
tensive numerical  experiments  with  Stefan  like  problems  reported  in  [3]  indicate 
that  even  a fairly  crude  numerical  integration  of  the  one  dimensional  equations 
(2.7)  and  (2.8)  for  discrete  values  of  the  parameters  y and  x can  adequately 
reproduce  a-priori  known  solutions.  Therefore,  always  under  the  hypothesis  of 
invertibility  of  the  free  surface,  the  following  algorithm  can  be  suggested  with 
confidence  for  the  numerical  solution  of  (2.7,8). 

Let  X and  Y be  upper  bounds  on  the  position  of  the  free  surface  along  the 
x and  y axes  so  that  D(t)  will  always  be  contained  in  the  rectangle  [0,X]  x f o , Y ] . 
On  this  rectangle  we  shall  place  mesh  lines  x = x^,  i = 0,**-,M  and  y = y. , 
j = 0,*-*,N.  We  then  shall  solve  (2.7)  along  those  lines  y = y^  which  can  be 
expected  to  intersect  D(t),  and  (2.8)  along  the  lines  x = x^  which  also  intersect 
D(t).  Let  us  consider  in  detail  the  numerical  solution  of  (2.7). 

It  is  assumed  that  at  the  initial  time  t = tn  the  solution  u is  known  at 

the  mesh  points  {(x^,y.)}  of  the  grid  on  [0,X]  x [o,Y].  It  is  also  assumed 

that  the  free  boundary^  = s2(y,t  ) is  known  along  each  line  y = y.  belonging 

to  D(t  ).  However,  it  need  not  coincide  with  a mesh  point.  Along^any  line 

y = yi  the  solution  u(x,y-s,t)  of  (2.7)  is  advanced  over  the  fractional  time 

step  At/2  in  one  numerical  time  step  of  equal  length.  (Conceivably,  if  the 

diffusion  takes  place  primarily  in  one  direction,  an  unequal  weighting  of  the 

fractional  steps  and  a refined  time  step  for  the  numerical  integration  in  the 

dominant  direction  may  be  preferable.  So  far,  no  such  asymmetry  in  the  method 

has  been  used.)  Invariant  imbedding  will  be  applied  to  the  time  implicit 

approximation  of  (2.7)  at  t . We  shall  write  (2.7)  in  first  order  form  as 

n+i 
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(2.9) 


(ku')  = v 


v ' 


^bu  + ^c 


u-u 

n 


At/2 


+ hf 


where  it  is  understood  that  all  functions  depend  on  the  independent  variable  x 
and  the  parameters  and  The  appropriate  boundary  conditions  are 


(2.10) 


v(o)  = n(o.yj  »tn+;5)u(o)  + a(o,y  ,tn4Jj) 


and,  if  the  line  y = y.  intersects  the  free  boundary  9D2(t), 

(2.11a)  u(sx)  = gc^st.yj.t^) 

si-si(y.,t  ) 


(2.11b) 


v(Sl)  = 8l(s1>yj, 


where  s^  = si(y.,tn+^)  is  the  unknown  location  of  the  free  boundary  3D2(t  +,)  on 
the  line  y = y^  If^this  line  does  not  intersect  3D2(t  ^)  then  these  two 
equations  are  replaced  by 


(2.11c) 


-v(X)  = n(X,yi>tn+^)u(X)  + a(X,yj  , t^) . 


It  may  be  noted  that  the  equations  (2.9-11)  have  to  be  solved  along  each  line 
y = y^ , j = 1, • • • ,N  but  that  they  are  uncoupled. 

According  to  the  invariant  imbedding  formalism  we  write 
(2.12)  v = R(x) u + w(x) 

where  R and  w are  found  from  the  initial  value  problems 


(2.13a) 

(2.13b) 


R'  " +£--TR-iR2’  R(0)  " "(0-VW 


al  1 

T + kR 


W"A7Un  + ?3f’  w(°)  = a(°,y.,tn+)5) 


Since  u^  typically  is  only  known  at  the  mesh  points  x^,  i = along  the 

line  y = y.  an  integration  of  (2.13)  may  require  an  interpolation  of  un  between 
the  mesh  points.  If  the  trapezoidal  rule  is  used  the  integration  of  (2.13a  and  b) 
can  be  carried  out  by  formula  without  the  need  for  interpolation.  This  route  is 
followed  in  the  numerical  computation  below.  In  order  to  place  the  free  boundary 
s^  we  observe  from  (2.12)  and  (2.11a  and  b)  that  sj  must  be  chosen  so  that  the 
relation 

si-si(y.,tn4, ) 

8l(si,yr tdi .tn+j5)  = R(s1)g0(sj,yrtn+!5)  +w(8l) 
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holds.  Hence,  as  (2.13)  is  integrated  we  monitor  the  function 

(2.14)  Kx)  5 R(x)g0(x,y.,tn+l5)  + w(x)  - gj  (x,y^  , ,tn+Jj) 

and  determine  the  root  of 

(2.15)  «f(x)  = 0 

by  linear  interpolation  between  successive  x mesh  points  between  which  changes 
sign  for  the  first  time.  The  validity  of  this  procedure  for  Stefan  problems  is 
proven  in  [2].  If  no  such  root  is  found  on  [0,X]  then  this  line  does  not 
intersect  the  free  boundary  and  we  can  set  S]  = X.  Finally,  the  solution  u at 
t = t along  the  line  y = y.  is  formed  by  integrating  backward  from  S]  to  0 
the  equation  J 


(2.15)  u'  - . H(x)u+w(x) 

k(x*yrW 

subject  to  the  initial  condition 


or 


u(s])  = g0(si,y  ,tn+Jj)  if  Sj  < X 


u(X) 


(w(X)  + n(X,y  . ,tn+4j) 

R(X)  + n(X,y  , t . ) 
3 n-Hs 


if  sj  3 X 


where  the  last  condition  is  obtained  by  substituting  (2.12)  into  (2.11c).  Again, 
the  trapezoidal  rule  can  be  used  without  interpolation.  In  general,  a fractional 
step  will  be  used  to  go  from  Sj  to  the  first  regular  x mesh  point  to  the  left  of 
si- 

In  order  to  integrate  (2.8)  along  discrete  lines  x = x.  over  the  time 
interval  [t  ^jt  ] we  proceed  as  above.  An  implicit  time  approximation  is 
used  to  conveft  12:8)  into  a free  boundary  problem  for  a second  order  linear 
ordinary  differential  equation.  It  is  written  as  an  equivalent  first  order 
system  which  looks  like  (2.9)  except  that  aj  is  replaced  by  a2  and  u is 
replaced  by  u All  functions  are  now  evaluated  at  x.  and  t i9  depend 

on  the  independent  variable  y.  The  equations  (2.10)  an^  (2.1 l5*  are  replaced 
by 


v(0)  = n(x.,0,t  )u(o)  + a(x.,o,t  ) 

l n+i  l n+l 

u(s2)  = go(xjL»S2  ,t:n+l^ 

s2-s2(x. ,t  , ) 

v(s2)  = g2(x1,s2, ^ ,tn+i). 
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-v ( Y)  = n(x.,Y,t  )U(Y)  + o(x.,Y,t 

l n+1  i n+1 


whenever  tlie  line  x = x^  does  not  intersect  3D;>(t). 

if  u j is  needed  outside  D(t  ,,  ) to  integrate  (2.13b)  it  is  extended  linearly 
beyomKthe  free  boundary.  A slight  difficulty  now  arises.  At  time  t = t , 
the  free  boundary  is  given  as  x = sj(y,  •w  along  each  y line  whereas  ~ 
now  the  inverse  function  si  is  required.  Irt  our  numerical  method  y = s?(x^,t  .) 

is  obtained  again  by  linear  interpolation  between  successive  y mesh  points 
between  which  the  expression  u(x  . ,y , t v)-gg (x . ,y , t ^ ) changes  sign.  For 
simple  expressions  of  gg  such  as1gg  = a *'the  validitv  of  such  placement  of  the 
free  boundary  is  easy  to  demonstrate.  After  3D?(t  ) is  found  the  inverse 

function  x = si(y,,t  ,,)  of  v = s->(x.,t  , ) is  found  bv  interpolating  and  the 
integration  over  the  next  time  step  can  begin. 


3.  THE  MUTKO!)  OF  LINES  FOR  FREE  SURFACE  PK0B1.KMS . In  many  applications 
the  free  surface  cannot  conveniently  be  expressed  as  an  invertible  curve  with 
respect  to  two  mutually  orthogonal  axes.  This  Is  the  case,  for  example, 
whenever  the  movement  takes  place  primarily  along  one  of  the  coordinate  axes. 
On  the  basis  of  the  numerical  experiments  reported  in  [A],  [5]  it  can  be 
concluded  that  the  method  of  lines  discretization  coupled  with  an  iterative 
solution  of  the  resulting  multipoint  free  surface  problem  will  also  lead  to  a 
transparent  and  flexible  numerical  method.  For  definiteness  let  us  again 
consider  problem  (2.1-5)  with  the  understanding  that  the  motion  of  the  free 
boundary  takes  place  along  the  x-axis.  The  hypothesis  of  section  2 can  now  be 
replaced  by  a weaker  requirement. 


Hypothesis:  The  free  boundary  has  no  horizontal  tangents. 
I'nder  this  condition  the  boundary  A D o ( t ) can  be  expressed  as 


x = sj(y,t) . 


In  the  method  of  lines  algorithm  discretizing  y and  replacing  ail  derivatives 
with  respect  to  y by  difference  quotients  is  suggested.  Again,  working  on  the 
square  f 0 , X ] * |0,Y]  let  0 = y0  < vj  <*••<  y = Y define  a partition  which 
for  convenience  is  chosen  equidistant.  Along  the  lines  of  constant  y the 
equations  (2.1-5)  then  are  replaced  by  a system  of  coupled  one  dimensional 
equations.  Specifically,  along  the  line  v = Vj  we  shall  solve 


(3.1) 


L . u = ( ku  ) + a i u 

.1  xx  1 x 


+ (b  - 


(kiVk±V 


Ay? 


) u - cu 


F(x,yj,uj-ruj+rt) 


where  F(x,y 


>Ui-l 


■Vr0 


(f 


k . , i u . +k , , u . u , , -u , 

_ ,j-h  J.-1  _ a„  ■_!+)._  1- 

Ay2  1 2 p A y 


where  k.+c  = (k(x,y.+1,t)  + k(x, y . , t) ) / 2, 

and  where  all  other  data  functions  are  evaluated  at  (x,y.,t).  The  approximation 
of  a by  a central  difference  quotient  was  chosen  for  Convenience  only.  Maximum 
print  iplt  arguments  advanced  for  one  sided  quotients  in  finite  difference  methods 
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may  apply  here  as  well,  but  computational  experience  still  is  lacking.  The 
initial  and  boundary  conditions  for  (3.1)  are 


(3.2)  u(x,y.,0)  = u0(x,y0) 

(3.3)  k(o,y.,t)|^  = n(0,y.,t)u  + a(0,y  ,t) 

3 J J 

(3.4)  u ( s j , y j , t ) = g(sj,y  ,t) 

dsj 

(3.5)  k(s1,y.,t)ux  = gi(s],y.  ,-jj-.t) 


if  the  line  y 
replaced  by 


y.  intersects 


3D2(t), 


otherwise  the  two  conditions  (3.4,5)  are 


- k(X,y.,t)^  = n(X,y.  ,t)u(X)  + ci(X,y.,t). 
J J J 


We  see  that  formally  this  system  is  identical  with  (2.7)  except  that  now  a 
coupling  to  the  solution  along  neighboring  lines  exists  through  the  source 
term  F.  We  remark  that  fractional  step  methods  based  on  splittings  analogous 
to  (3.1)  exist  where  the  coupling  then  is  removed  through  time  lagging  [8]. 

Here  we  intend  to  solve  the  fully  coupled  system  through  an  SOR  type  iteration. 
Specifically,  we  shall  solve  over  discrete  time  steps  of  length  At  and  starting 
with  a given  initial  condition  u(x,y,t  ) on  a known  domain  D(t  ) the  following 
sequence 

t E (t  ,t  + At) 
n n 

j = 0 , • • • ,N 

m = 1 ,2,  • • • 

(3.2-5)  and  the  initial  condition 


(3.6) 


. - „ / m m- 1 , 

V ■ F(,',j'V.’V>’t) 


m m~  1 , , m- 1, 

U . = U . + UHU-U  . I 

J J J 


where  u satisfies  the  boundary  conditions 


u(x,Vj,tn)  = u(x,y . , t^) . 

For  the  initial  guess  we  choose 

u°(x,y^,t)  = u(x,y . , t^) 

since  this  initial  condition  appears  to  be  of  little  importance  to  the  calculation. 
The  final  solution  u(x,y.,t)  for  t e [t^.t^  + At)  is  obtained  as 

u(x,y  ,t)  = lim  um(x,y,,t). 

jh-ko  3 
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Computational  evidence  suggests  the  existence  of  this  limit  for  Stefan  like  free 
boundary  problems.  A mathematical  justification  so  far  has  been  given  only  for 
certain  nonlinear  fixed  domain  problems  [6], 


The  numerical  solution  of  (3.6)  proceeds  exactly  like  that  of  (2.7).  The 
solution  (u,si)  along  the  line  y = y is  advanced  over  one  time  step  with  the 
equations  (2.12;  2.13a,b;  2.15  and  2.16)  except  that  the  following  replacements 
are  made: 

WVi:  ^(b  - J4Ay2JtiS>:  ^F(x>yru“_i,u“;;,tn+1) 


and 


x-sj(yj,tn)  x-S] (y^ ,tn) 
At/2  ^ At 


4.  A NUMERICAL  EXAMPLE.  The  existing  computer  codes  of  [ 3 ] and  [ 4 ] for 
diffusion  problems  with  Dirichlet  data  (rather  than  flux  data)  on  the  fixed  boundary 
were  modified  to  solve  numerically  the  following  two  dimensional  problem 


(4.1)  Au  - ut  = 0 


(x,y)  e D(t) , t>0 


(4.2) 


u = u0(x,y) 


(x,y)  e D(0) 


(4.3) 

(4.30 


u = 


a(x,y,t) 

0 


3D} (t)  [{x=0}  {y=o}] 

aD^t)  [{x=l}  { y=  1 } ] 


(4.4) 

(4.5) 


u = 0 

Vu  = - (gx(x,y)  + -p  g2(x,y)  + ^) 


(x,y)  e 3D2 (t) 


where  D(0)  = [o,l]  x [0,1]  - {(x,y):  (x-l)2  + (y-l)2  < O^-)2}.  The  free  boundary 
3D2(0)  was  the  portion  of  the  circle  of  radius  r = -jj-  lying  in  the  unit  square. 
The  functions  gj  and  g2  were  chosen  to  qualitatively  model  the  ablation  of  a 
material  occupying  D(t)  and  maintained  at  a given  temperature  a on  the  axes  due 
to  radiative  input  from  a source  at  x = 1,  y = 1.  Specifically,  we  chose  the 
source  terms 


g;  = (l-x)/((i-x)2  + (l-y)2)3/2,  g2  = (l-y)/((l-x)2  + (l-y)2)3/2 

to  denote  an  energy  input  on  the  free  surface  proportional  to  the  inverse  of  the 
distance  to  the  source  times  the  cosine  of  the  angle  of  incidence  between  the 
direction  to  the  source  and  the  respective  coordinate  axes. 
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In  Fig.  1 we  dispLav  the  progress  of  the  free  boundary  3D2(t)  for  the 
special  boundary  and  initial  conditions  a = 1 and  ug  = 1 on  Dfo) , uq  = 0 
otherwise.  The  solution  is  now  expected  to  be  symmetric  wfch  respect  to 
the  line  v = x.  The  code  of  [3]  could  handle  the  boundaries  quite  effectively, 
doing  from  a 40*40  grid  on  the  unit  square  and  a time  step  of  At  = 0.2/100  to 
the  much  coarser  20*20  grid  and  time  step  At  = 0.2/50  resulted  in  less  than  4% 
change  in  the  position  of  the  free  surface  while  further  refinements  (up  to 
100*100  grids)  left  the  position  unchanged.  In  Figs,  la  and  lb  we  show  the 
free  surface  every  ten  time  step  on  the  40*40  and  20*20  grids.  The  irregular 
behavior  near  v = 1 should  be  ignored  because  the  free  surface  v = si(x,t)  is 
plotted  for  fixed  •;  values  which  in  general  do  not  coincide  with  the  point  of 
intersection  between  sj(x.t)  and  y = 1.  The  actual  free  boundaries  are  quite 
svmmetrie.  -'nr  example,  the  computed  values  for  the  intersection  with  x = 1 
and  v = 1 in  Fig.  la  are  C 1 » - 44 11)  and  (.4483,1)  at  t = 0.2.  It  is  apparent 
from  these  results  that  the  fractional  step  method  could  cope  quite  well  with 
tiie  free  boundaries  observed  here.  On  the  basis  of  the  numerical  experiments 
in  [ 3 ) this  was  expected.  Some  comments  on  the  run  times:  50  time  steps  with 
the  20*20  grid  required  22  secs,  100  time  steps  with  the  40*40  grids  required 
178  secs,  and  40  time  steps  with  a 100*100  grid  required  329  secs  which  amounts 
to  roughly  10”  secs  per  mesh  point  per  time  step. 


The  method  of  lines  approach,  (with  y taken  as  the  continuous  variable) 
on  the  other  hand,  could  not  cope  adequately  with  this  problem.  Admittedly, 
the  asymmetry  of  the  boundary  conditions  and  the  vertical  tangent  of  the  free 
surfaces  at  v = ; made  this  method  less  attractive  than  the  fractional  step 
method.  However,  the  method  had  worked  quite  well  in  the  sample  problems  of 
[4]  and  f 5 1 so  that  its  poor  performance  here  came  as  somewhat  of  a surprise. 

A representative  result,  somewhat  analogous  to  that  of  Fig.  la  is  shown  in 
Fig.  2.  Again,  the  shape  of  the  free  boundary  deteriorates  near  y = 1 because 
its  intersection  with  y = 1 is  set  to  nearest  mesh  point  on  the  left.  Because 
of  tiie  coarse  x-grid,  the  distortion  is  quite  pronounced;  however,  the  slow 
convergence  associated  witli  a more  evenlv  spaced  grid  and  the  enormously  long 
execution  times  present  for  large  grids  precluded  an  effective  refinement  of 
the  above  grid.  A comparison  between  the  answers  of  Fig.  la  and  Fig.  2 shows 
discrepancies  of  the  order  of  12%  on  the  line  x = 1 and  of  21%  on  the  line 
y = 1 at  t = 0.2. 


These  results  indicate  that  the  method  of  fractional  steps  is  considerably 
faster  and  in  certain  geometries  more  accurate  than  the  method  of  lines  as 
presently  implemented.  Since  the  Latter  method  applies  under  considerably 
weaker  hypotheses  the  challenge  is  out  to  improve  the  method  of  lines,  possibly 
through  a finite  element  space  discretization  to  make  it  competitive  with  the 
fractional  step  method. 
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Fig.  2.  Free  surfaces  after  every  10  time  steps  for  the  ablation 
problem  (4.1-5).  Method  of  lines  solution  for  Ax  = 1/20, 

Ay  = 1/40;  At  = 0.2/100.  For  a relaxation  factor  of  w = 1.4 
about  10  SOR  cycles  were  required  for  each  time  step.  The 
results  were  asymmetric  and  poor  compared  to  the  fractional 
step  results.  Computing  time  was  an  enormous  980  secs. 

The  method  of  lines  in  its  present  implementation  did  not 
solve  this  problem  satisfactorily. 
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ON  THE  NUMERICAL  SOLUTION  OF  THE 
POISSON  EQUATION  BY  THE  CAPACITANCE  MATRIX  METHOD 
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ABSTRACT . A discrete  potential  theory  is  developed  for  the  capac i t 
matrix  method  that  enables  us  to  prove  rapid  convergence  of  conjugate  gr  i 
iterations  in  solving  the  capacitance  matrix  equation.  Operation  count  of 
stant  n~  log  n can  be  attained  for  second  order  accuracy  schemes  of  th>- 
problem  and  high  order  accuracy  schemes  of  the  Dirichlet  problem. 

1.  INTRODUCTION . The  purpose  of  this  paper  is  to  present  some  tl  • 
and  experimental  aspects  of  efficient  solvers  for  the  linear  systems  of  • s 
arising  from  finite  differences  discretization  of  the  Dirichlet  or  the  N<u.-:' 
problem  of  the  goisson  equation.  The  operation  count  of  our  algorithm  is 
portional  to  n log  n,  where  n = 1/h  . It  is  shown  that  high  order  accura 
can  be  achieved  for  the  Dirichlet  problem  and  second  order  accuracy  can  be 
achieved  for  the  Neumann  problem.  The  algorithm  given  here  can  be  easily  < 
to  the  Helmoltz  equation  and  to  three  dimensional  problems. 

2.  CERTAIN  RESULTS  OF  CLASSICAL  POTENTIAL  THEORY.  We  give  here  a M 
summary  of  certain  classical  results  that  are  useful  for  our  purposes. 

Let  Of  denote  the  potential  resulting  from  a single  layer  charge  distr 
tion  p on  a smooth  boundary  curve  , 

y(x)  = ( 1/tt ) j p(f)  log  r ds  ( 0 • 

30 

_ 2 2 2 

Here  x=  (Xj.x,),  C=  and  r = ^X]~  1*  + • Similarly, 

potential  Or  of  a dipole  density  on  30  is  defined  by 

>(x)  = (1/tt)  / p ( 0 (3/3v  J log  r ds  ( 0 . 

30 

We  adopt  the  convention  that  the  normal  direction  is  toward  o ext' 
of  the  region  0 in  which  we  want  to  solve  our  problem.  It  can  be  shown 
e.g.  [1]  or  [2])  that  the  Neumann  and  Dirichlet  problems  of  Poisson'  ' t ; 
can  be  reduced  to  Fredholm  integral  equations  of  the  second  kind.  For  tin 
terior  Dirichlet  problem,  we  make  the  Ansatz 

u (x)  =>(x) 


for  the  solution  of 


(2.1) 


Au  = 0,  x e ft 
u = q,  x e 9!)  . 
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The  boundary  condition  is  satisfied  by  choosing  y such  that 


y +CL/TT)  / y (3/3v  r ) log  r ds(£)  = g . 
3ft  ^ 


This  equation  can  be  written  as 


(I  + K)y  = g 


where  K is  the  compact  integral  operator  defined  by  the  integral  above. 
For  the  interior  Neumann  problem,  we  make  the  Ansatz 


for  the  solution  of 


u(x)  = Vlx) 


A u = o,  x eft 
3u/3v  = g,  x e 3ft 


The  boundary  condition  is  satisfied  by  choosing  p such  that 


(I  - K ) p = g , 


where  K is  the  same  as  in  (2.1) 


It  is  known  that  (2.2)  has  a unique  solution  for  each  g and  the  solution 
of  (2.4)  is  unique  up  to  an  additive  constant. 

3.  THE  CAPACITANCE  MATRIX  METHOD.  We  are  concerned  with  solving  (2.1) 
or  (2.3)  by  the  finite  difference  method.  Let  R denote  the  uniform  mesh  on 
the  plane. 


Let  ftjj  = ft  n . Let  A denote  the  coefficient  matrix  of  our  discrete 

problem  on  ft^  . Let  G denote  the  discrete  analog  of  log  r with  respect  to 
the  five  point  formula  for  the  discrete  Laplacian.  Let  B denote  the  coefficient 
matrix  for  the  discrete  Laplacian;  BGv  = v and  Gv  is  well  defined  for  any 
n X n vector  v . We  extend  the  domain  of  A to  R^  by  adjoining  the  rows 


of  B on 


. We  extend  the  domain  of  ? 

* 

to  A . The  new  matrix  A 


thus  formed  differ  with  B by 


only  m rows,  where  m = the  number  of  irregular  mesh  points  near  the  boundary. 
Note  that  m ~ 1/h  . We  can  therefore  write 

T 

A = B + UW 

2 t 2 

where  U is  the  n X m extension  by  zero  matrix  and  W is  an  m X n matrix. 

U retains  the  value  of  any  vector  defined  only  on  the  irregular  mesh  points  and 

extends  it  by  zero  to  all  mesh  points.  W7  describes  how  the  boundary  conditions 

are  approximated.  The  restriction  of  the  solution  u of 


to  ft,  is  the  solution  for  our  original  discrete  problem  provided  that  A is  a 
h . 

reducible  matrix  with  no  coupling  from  ft^  to  R,\  si,  . 


We  now  describe  our  algorithm  for  solving  (3.1).  In  the  Dirichlet  case,  we 
make  the  double  layer  Ansatz 


u = Gv  + GVy  . 

The  m-vector  U is  determined  by  solving 

(3.2)  (I  + WTGV)U  = -WTGv  . 

m 

2 

The  n X m matrix  V transforms  U into  a double  layer  charge  distribution. 
Each  column  of  V contains  a discrete  dipole  of  unit  strength  in  the  normal 
direction.  See  e.g.  section  3 of  [3)  for  details. 

In  the  Newumann  case,  we  make  the  single  layer  Ansatz 

u = Gv  + GU  p . 

The  m-vector  p should  satisfy  the  equation 

T T 

(3.3)  (I  + W GU)  p = -W  Gv  . 

m 

Both  equations  (3.2)  and  (3.3)  are  called  capacitance  matrix  equations. 

The  matrix  on  the  left  hand  side  of  (3.2)  or  (3.3)  is  called  the  capacitance 
matrix  which  we  denote  by  C . See  also  [4]  for  details. 

T 

We  now  describe  our  method  of  constructing  W . In  the  Dirichlet  case, 
we  use  the  following  schemes  for  interpolating  the  boundary  conditions. 

The  first  scheme  which  we  shall  call  scheme  I is  as  follows.  Let 
P t 32^ , the  set  of  irregular  mesh  points.  Suppose  that  the  local  orientation 
of  boundary  is  such  that  either  W,  the  western  neighbour  of  P,  is  in  R \ 
or  both  W and  N,  the  northern  neighbour  of  P is  in  R^\  0^  . We  form  the 
equation 

4u(P)  - u (E)  - u(S)  = h2f(P)  + 2g(P) 

In  the  second  scheme,  called  scheme  II,  we  use  an  interpolation  formula  of  second 
order  accuracy  for  the  Dirichlet  data  and  obtain  the  equation 

4u(P)  - (1-a^)  u (t)  - (l-a^)  u (s)  = h2f(P)  + (a3+c*4)  g(P) 

where  a , ci^,  and  are  suitable  Lagrangian  coefficients.  In  the  third 

scheme,  which  we  shall  call  scheme  H,  we  use  a sixth  order  interpolation  formula 
to  interpolate  the  Dirichlet  data.  The  equation  at  P will  typically  be  in  the 
form 

5 5 10 

4u(P)  - l 6.  u (E . ) - l y . u (S . ) = h f(P)  + l 5.  g(P)  , 

.‘•,1  l . L , l i .L , l 

i=l  i=l  i=l 

where  E.,  S.  are  suitably  chosen  mesh  points  very  close  to  P and  8.,  Y.  and 

li  li 

6^  arise  from  the  Lagrangian  coefficients.  See  e.g.  [5]  for  details. 
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The  last  scheme  is  a highly  artificial  scheme  not  intended  for  practical 
use.  It  is,  however,  indispensable  for  theoretical  purposes.  Typically  we  form 
the  equation 


(5  + a)  u(P)  - u (E)  - u (S)  - u(H)  - au(P’)  = h 2 f ( P ) + 2g(P)  . 

The  parameter  a is  to  be  determined  as  described  in  section  5 P'  is  a suitably 
chosen  point  sufficiently  close  to  P . We  shall  refer  to  this  scheme  a:  scheim  X 

In  the  Neumann  case,  we  use  che  following  schemes.  The  first  ;cl  m<  wb i -h 
we  refer  to  as  scheme  I.N  is  constructed  as  follows.  Let  i bo  the  angle  that 
the  normal  through  P intersecting  the  boundary  at  P makes  with  the  x-axis. 

We  approximate  [3u/3\>]  (P* ) by  [u(W)  - (1-tana)  u(P)  - (tana)  u(S)]/h  sect. 

We  then  eliminate  u(W)  from  the  five-point  formula  to  obtain 

(3  + tana)  u(P)  - u(E)  - u(N)  - (1  + tana)  u(S)  = If  f(P)  + h sec.  q(F*>  . 

The  second  scheme  which  we  call  scheme  II  is  much  more  elaborate.  The  normal 
derivative  at  P is  approximated  as  a linear  combination  of  u(W),  u(S')  and 

u(E')  . Here  S'  and  E'  are  respectively  the  points  on  the  normal  through  W 
intersecting  the  mesh  lines  through  S and  E . We  then  interpolate  u(S')  by 
a linear  combination  of  u(N),  u(P)  and  u(S)  and  do  a similar  interpolation  for 
u(E')  . We  then  eliminate  u(W)  from  the  five-point  formula  and  obtain  a six- 

point  formula  for  the  Laplacian  and  the  boundary  condition.  See  e.a.  [6]  for 
details.  We  remark  that  this  is  not  the  most  compact  scheme  possible,  bramble 
and  Hubbard  in  [7]  obtained  a positivity  scheme  using  only  three  mesh  points 
while  scheme  II. N does  not  even  have  positivity.  It  does,  however,  use  values 
at  points  very  close  to  P and  we  can  obtain  convergence  proof  using  th.  discrete 
potential  theory  sketched  in  section  5.  See  [6]  for  details. 

Finally  we  remark  that  as  far  as  the  Poisson  equations  is  concerned,  w.  can 
first  find  the  values  of  u at  using  the  integral  equation  method  and  then 

back  solve  via  the  capacitance  matrix  method.  See  e.g.  [6] . The  schcr  II. N, 
however,  has  the  advantage  to  be  easily  extendable  to  the  Helmholtz  • . ut  : n 
and  to  the  three  dimensional  problems. 

4.  NUMERICAL  PROCEDURES  AND  OPERATION  COUNTS.  In  the  Dirichlet  case,  we 
use  the  discrete  Green's  function  on  the  entire  plane  as  our,  Gn  in  equatj,on^ 

(3.2)— (3.3) . It  is  shown  in  [3]  that  G(r,s)  = (1/4tt)  loq(r“+s  ) + c,/(r  +s") 

22223  2^2 

+ c^  r s /(r  +s  ' + Rj , where  | R^  j _<  C^/(r  +s~)  , c^,c\.  and  c ^ are  known 

positive  constants.  We  can  therefore  generate  G by  one  call  of' a fact  solver 

using  distant  values  of  G as  the  Dirichlet  data. 

Because  of  translational  invariance,  this  is  all  we  need  to  set  up  the 
capacitance  matrix  C . Equation  (3.2)  is  then  solved  by  the  conjugate  gradient 
method.  We  shall  see  in  section  5 that  because  of  the  favorable  distribution  of 
the  singular  values  of  C,  the  convergence  is  essentially  independent  of  h . 
Typically,  only  four  or  five  iterations  are  needed  to  achieve-  an  accuracy  of  10“  ' . 
The  operation  counts  are  therefore  approximately  10m^  (Note  that  C^CU  = CTb 
is  actually  solved  for  the  conjugate  gradient  routine. 

T T 

The  capacitance  matrix  equation  can  be  simplified  to  C p = t)  v if  v = UU  v 
This  will  be  the  case  if  we  solve  the  Laplace  equation  instead  of  the  more  general 
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Poisson  and  if  we  make  the  Ansatz  u = GVU  . Because  V[J  is  a sparse  vector, 
the  boundary  values  of  u on  the  sides  of  a rectangle  containing  can  be  com- 
puted at  a cost  of  constant  iv  . Therefore  we  can  backsolve  for  u by  only  one 
call  of  fast  Poisson  solver.  The  total  operation  counts  in  this  case  is  just  ap- 
proximately 5n2  log  n + SOn"  . If  v is  not  equal  to  UUTv,  we  must  first 
compute  WTGv  . It  is  shown  in  [3]  that  Wtgv,  Gv  and  GVh  can  be  computed  at 
approximately  the  cost  of  two  calls  of  fast  Poisson  solvers.  The  total  operation 
counts  are  estimated  to  be  10n2  iQg  n + 120n2  . 

In  the  Neumann  case,  we  have  used  the  iterative  imbedding  technique  first 
used  by  J.  A.  George  in  [8] . We  no  longer  require  G to  be  translational  in- 
variant. The  vectors  Cp  are  computed  as  follows.  First  compute  Up  . Solve 
By  = Up  . Then  compute  p + WTy  . We  note  that  Up  is  a sparse  vector  and  the 
values  of  y are  only  needed  in  a neighbourhood  of  SSI  . Therefore  WTy  can 
be  computed  at  a cost  of  approximately  8n^  using  the  discrete  Fourier  transform 
methods  studied  in  [9]  or  [10] . The  total  operation  count  is  therefore  approximate- 
ly 160n2  . 


5.  DISCRETE  POTENTIAL  THEORY.  In  the  Dirichlet  case,  we  must  show  that  C 
is  nonsingular  for  at  least  sufficiently  small  mesh  sizes.  In  the  numerical  ex- 
periments C is  uniformly  well  conditioned  (see  also  [111)  if  the  double  layer 
Ansatz  is  used.  It  is,  however,  much  more  difficult  to  prove  rigorously  that 
C must  be  nonsinguiar.  The  proof  given  in  [11!  can  only  be  modified  to  prove  the 
nonsingularity  of  C for  the  single  layer  case.  In  the  following,  we  sketch 
two  proofs,  one  for  special  cases  and  the  other  for  general  regions. 

DEFINITION.  The  near  diagonal  part  of  C is  defined  to  be  the  matrix 

that  satisfies  the  following.  B (P,Q)  = C(P,Q)  if  d(P,Q)  /h,  B (P,Q)  = 0 
if  otherwise.  h 


DEFINITION . The  off  diagonal  part  K of  C is  defined  by  K = C - B 


it  is  easy  to  see  that  the  of  suitably  scaled  C are  formal  approxi- 

mation to  the  compact  operator  K m equation  (2.3).  Choose  operators  from 
C * C such  that  K have  the  same  nonzero  eigenvalues  as  K . Here  C is  the 
Banach  space  of  continuous  functions  on  [0,1]  with  the  sup  norm.  It  is  shown 
in  [3]  that  the  following  holds. 


(K  } 

m 


T 

THEOREM  1 . B^  + _>  otl ; a _>  1 for  scheme  I,  a 0.4  for  scheme  II. 

is  collectively  compact  on  C , K x -*  Kx  for  each  x E C . 


given 


It  can  then  be  shown  from  the  theory  of  collectively  compact  operators  that 

E > 0,  K + Kr  > K + Kt  - E for  sufficiently  small  h . 
h h - 


DEFINITION, 
associated  with  ft 


He?  if  K + K1  > - 3 1 . Here 
by  the  line  integral  in  (2.2). 


K 


is  the  compact  operator 


The  foregoing  discussion  then  easily  lead  to  the  following  result. 


THEOREM  2.  Let  ft  C 3,  , g = 1 (or  0.4)  if  scheme  I (or  II)  is  used. 
Then  C is  uniformly  invertible  in  the  spectral  norm. 

It  is  known  that  3^  is  nonempty  for  any  6^0.  All  ellipses  with 
(a-b)/(a+b)  = B are  in  3,<  where  a,b  are  the  half  axes  of  the  ellipse. 

See  e.g.  ( 3]  . 
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We  next  establish  that  C is  nonsingular  with  its  smallest  singular  value 
larger  than  1/ [ K ( A) ] where  K(A)  denotes  the  condition  number  of  A for  at 
least  sufficiently  small  mesh  sizes.  Let  ||c||  denote  the  maximum  norm  of  a 

matrix  C . It  is  shown  in  [3]  that  ||c  ^ ||  <_  1/[K(A)]^  if  there  exists  a scheme 
of  interpolating  boundary  conditions  such  that  the  corresponding  C is  uniformly 
invertible  in  the  maximum  norm.  It  can  be  shown  (see  e.g.  [12])  that  suitably 
hosen  a's  will  make  the  row  sums  of  B^  one  while  preserving  the  diagonal 

ominance  of  B,  . If  we  now  form  B from  B in  the  same  way  as  K are 

h m h m 

armed  from  K^,  then  the  following  holds. 

j LEMMA  1.  If  scheme  X is  used,  then  B x ->  x for  each  x £ C . 

11®,  |l_  < constant  and  B -*■  x for  each  x E C . We  can  then  prove  the  follow- 

m — jp  c 

inq  theorem  very  easily  using  some  basic  results  from  the  theory  of  collectively 
impact  operators.  See  [12]  for  details  of  the  proof.  See  also  [13]  for  a good 
account  of  the  theory. 

THEOREM  3.  Suppose  Bmx  -»■  x and  llBm^||  £ constant.  If  {K^}  is  col- 

1 jtively  compact  and  K x > x for  each  x £ C , then  II B + K II  “1  < constant 

m  5 * * * * *  11  m m 11  — 

iff  I + K is  invertible. 

It  is  easy  to  see  that  the  above  theorem  implies  that  C for  scheme  X is 
uniformly  invertible  in  the  maximum  norm.  This  proves  our  claim  that  C is  non- 
singular  in  general.  The  following  theorem  guarantees  the  rapid  convergence  of 
the  conjugate  gradient  iterations. 

THEOREM  4.  Suppose  C = B + K^,  where  ||b  ^ ||  < constant  and  the 
ingular  values  of  cluster  around  that  of  a compact  operator  K . If  in 

addition  C is  nonsingular  with  its  smallest  singular  value  _>  h~P,  p = positive 
’ nstant,  then  the  number  of  conjugate  gradient  iterations  needed  to  reduce  the 
• rror  E(x)  to  a given  accuracy  does  not  exceed  constant  times  log  m . Here 
i ( x ) = (1/2)  (x-x  J1  Q(x-x  ),  where  Q = C C and  x satisfies  Qx  = C b . 

PROOF.  See  e.g.  [3] . 

A similar  theorem  also  holds  for  the  Neumann  case,  see  e.g.  [4].  The  theory 
for  the  second  order  case  is  much  more  complicated  and  we  again  have  to  go  back 
to  classical  potential  theory  and  use  a theorem  similar  to  theorem  3.  It  is, 
however,  not  necessary  to  introduce  an  artificial  scheme  X.  Instead,  an  artificial 
a lifting  of  the  matrix  U has  to  be  introduced.  See  e.g.  [6]  for  details. 

5.  NUMERICAL  EXPERIMENTS.  We  have  found  that  the  convergence  rate  in  the 
lirichlet  and  Neumann  cases  vary  very  little  when  the  domains  and  mesh  sizes  are 
changed.  The  following  table  represents  a typical  situation.  The  Neumann  problem 

i.  'Olved  on  ellipses  with  b/a  = y,  where  a,b  are  the  half  axes.  The  test 

olution  is  sin(x+y)  . The  right  hand  side  is  therefore  2h^  sin(x+y)  on  regular 
nv'sh  points  and  2 cos (x+y)  ( * 1 . 0 ± tan (a) ) *h  + 2h  sin(x+y)  on  irregular  mesh 

t oints.  In  Table  I,  E(Vu)  = max{ 1 6 (u  -u) I , 1 6 (u  ,-u) } , where  6 and  6 are 

x h y h x y 

i 'spectively  the  difference  quotients  in  x and  y coordinate  directions.  The 

norm  of  C.G.  residual  is  computed  by  dividing  the  L norm  of  the  conjugate 
gradient  residuals  by  the  square  root  of  estimated  mesn  points  inside  fi  . 


TABLE  I 


A-  u = 2 sin (x+y) 
Scheme  I.N.  (h  = 1/48) 


No.  of  iterations 

No.  of  irregular 
mesh  points 

y 

Norm  of  C.G.  Residual 

E ( u) 

2 

108 

1.0 

.2803274-03 

1.0-02 

3 

108 

1.0 

.4790281-04 

1.0-02 

2 

92 

0.7 

.3755522-03 

2.5-02 

■J 

■J 

92 

0.7 

.1263798-03 

2.0-02 

2 

84 

0.5 

.6496120-03 

6.0-02 

3 

84 

0.5 

.3265076-03 

4.0-02 

We  remark  that  it  is  very  important  that  the  right  hand  side  of  the  con- 
tinuous problem  is  consistent  and  that  the  right  hand  side  of  the  capacitance 
matrix  should  be  consistent  whenever  the  same  holds  for  the  discrete  problem. 

The  latter  is  either  guaranteed  by  discrete  potential  theory  if  we  split  up  the 
matrix  U or  it  is  automatically  satisfied  if  we  do  not  split  up  U as  in  all 
practical  applications.  See  e.g.  [4]  for  details.  In  spite  of  this,  it  is,  how- 
ever, not  necessary  to  check  for  consistency  in  the  original  discrete  problem.  In 
fact,  the  solution  produced  by  our  algorithm  is  not  sensitive  to  small  changes  in 
the  right  hand  side  of  the  discrete  problem. 

In  the  Dirichlet  case,  wo  can  obtain  results  of  high  order  accuracy  if 
scheme  H is  used.  Deferred  correction  methods  or  Richardson  extrapolation  can 
be  used  to  obtain  sixth  order  accuracy  results.  See  [5]  for  details.  The 
table  II  is  taken  from  section  of  [5] . Similar  results  of  fourth  order  accuracy 
obtained  by  conjugate  gradient  method  with  operation  counts  proportional  to 
n2  log  n can  be  found  in  [12] . 


TABLE  II 

-Au  = 2 sin (x+y) 
u(x)  = sin (x+y) 

Scheme  H 


Correction 

0 

-5 

) 

-9 

2 

-12 

£ , k = 6 

OO 

1.9  10 

1.0 

10 

5.6  10 

e2  - < = 6 

1.0  10-5 

5.4 

IQ'10 

3.4  10-12 

L2  and  maximum-norm  errors  for  Dirichlet  problem  on  a circle 
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A POWER  SERIES  SOLUTION  OF  A HARMONIC  MIXED  BOUNDARY  VALUE  PROBLEM 


J.  Barkley  Rosser*  and  N.  Papamichael** 


ABSTRACT . The  ! irst  twenty  coefficients  for  a boundary  value  problem 
are  approximated.  This  gives  a useful  conformal  transformation. 


1 . Definition  of  the  coefficients  . 

We  consider  a rectangle  R with  corners  (i1,0)  and  (±1,1 ). 

Let  u(x,y ) be  the  function  continuous  on  and  inside  the  rectangle  such 
that  the  second  partial  derivatives  are  continuous  inside  the  rectangle, 
♦he  first  partial  derivatives  are  continuous  on  and  inside  the  rectangle 
■xcept  at  the  point  (0,0), 


1.1) 


V2  u(  x,y  )=  0 


. !•  ' rectangle,  and 


1.2) 

u(x,0) 

1.3) 

u (x,0) 

y 

1.4) 

u (x, 1 ) 

y 

1.5) 

ux(-i,y) 

1.6) 

u( 1 ,y) 

-1  < x S 0 

0 < x < 1 
-1  < x < 1 
0 < y < 1 
0 < y < 1 


# 
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Department  of  Mathematics,  Brunei  University. 

tisored  in  part  by  the  United  States  Army  under  Contract 
Do . DA-3  1 - 1 2L-AR0-D-L62  and  in  part  by  the  Science  Research  Council 
under  grant  B/RG  U 1 2 1 at  Br’onel  University. 
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r i 

The  function  500  + 500  u(x,y)  was  studied  in  Problem  2 on  p.660 
of  Whiteman  and  Papamichael  [1].  They  cite  a considerable  number  of 
papers  where  functions  as  closely  related  to  u(x,y)  as  theirs  have 
been  studied. 

Let  us  put 

(1.7)  z = x + iy . 

Then  u(x,y)  is  the  real  part  of  a function  u(z)  which  is  analytic 
inside  the  rectangle. 

We  will  show  that  there  are  real  constants  a such  that  inside  R 

n 

® i 

(1.8)  u(z)  = l an  zn  + 5. 

n=0 

Indeed,  the  series  on  the  right  of  (1.8)  has  radius  of  convergence  2. 

Thus,  we  conclude  that 

“ n + ' 

(1.9)  u(r  cos  6,r  sin  6)  = £ a r 2 cos(n  + ])e 

n =0  n 

for  positive  values  of  r and  for  0 < 0 < it  such  that  the  point 
(rcos  9 , r sin  0)  is  inside  R. 

In  the  present  report  a procedure  for  determining  the  coefficients 

a^  in  (1.8)  is  developed  and  is  used  to  calculate  numerical  approxima- 

■v  *x</  , 

tions  a to  a , for  0 < n < 19.  Approximations  u(x,y)  to  the  solution 
n n 

u(x,y)  of  problem  (1.1 )— ( 1.6)  are  then  obtained,  for  various  values  of 
(x,y)  = (r  cos  6,r  sin  0 ) in  R , from 

'v  'Xj  *1.  n + 3 

u(x,y)  = ufrcos  0,rsm  0)  = £ a r 2 cos(n  + 5 )0. 

n=0 
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2.  Some  proofs. 


The  first  order  of  business  is  to  prove  the  various  assertions 
made  in  Section  1.  We  shall  make  repeated  use  of  a "reflection 
principle".  Our  first  step  is  to  "reflect"  u(x,y)  relative  to  the  x-axis 
by  defining 


(2.1) 


v ( x ,y ) = 


u(x,y) 

u(x,-y) 


0 < y < 1 

- 1 < y < 0 . 


It  is  easily  shown  (see  our  later  discussion  of  w)  that  v is 
harmonic  inside  the  square  with  corners  (±  1,±  l)  except  along  the  "slit" 
defined  by  y = 0,  -1  < x S 0 ; along  this  "slit"  it  is  continuous  and 

equal  to  0,  but  v^(x,y)  is  discontinuous  across  the  "slit".  On  the  right 
side  of  the  square,  v = 1,  and  on  the  other  three  sides  the  normal  derivative 
is  zero.  As  v is  merely  a continuation  of  u across  the  x-axis,  we 
shall  call  it  u. 

By  "reflection"  relative  to  the  lines  y = 1 and  y = -1,  we  can  continue 
u into  the  rectangle  with  corners  (±1,±3);  it  will  be  harmonic  inside  the 
rectangle  except  along  three  "slits".  By  "reflection"  relative  to  the 
line  x = -1,  we  can  continue  u into  the  rectangle  with  corners  (-3, ±3) 
and  (l,±3)  ; again,  it  is  harmonic  except  along  three  "slits".  These 
three  "slits"  are  the  three  straight  line  segments  extending  respectively 
from  (-2,2)  to  (0,2),  from  (-2,0)  to  (0,0),  and  from  (-2,-2)  to  (0,-2). 

Finally,  we  define  w(x,y)  by 

(2.2)  w(x,y)  = fu(x’y)  -3SX*1 

2 - u(2-x,y ) 1 < x < 5. 

To  show  that  w is  harmonic  inside  the  large  rectangle  with  corners 
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S and  (5,±  3)  except  along  six  "slits",  we  reason  as  follows. 
w(x,y)  to  be  the  function  which  is  harmonic  inside  this  large 
' i:.  e except  along  the  six  "slits",  is  zero  on  both  vertical  sides 
rectangle,  has  zero  normal  derivative  along  the  top  and  bottom 
.a.-  rectangle,  is  -1  along  each  of  the  three  "slits"  on  the  left, 
from  (—2,2)  to  (0,2),  from  (-2,0)  to  (0,0),  and  from  (-2,-.  ) 

',-.')  respectively,  and  is  +1  along  each  of  the  three  "slits"  on 
it  , which  go  from  (2,2)  to  (4,2),  from  (2,0)  to  (4,0),  and  from 
, ■ ) to  (4,-2)  respectively. 

show  that  there  is  such  a w(x,y),  we  appeal  to  Theorem  1.25 
■1  f Tsuji  i7J-  Indeed,  let  us  define  the  "Schottky  prolongation" 
rectangle  as  at  the  bottom  of  p.31  of  Tsuji  [7].  Specifically, 
take  a Riemann  surface  of  two  sheets,  each  a rectangle  covering 
large  rectangle,  fastened  together  along  their  top  and  bottom  edges. 

(x,y),  -3  t x t 3,  -3,  £ y < 3,  to  denote  a point  on  the  top  sheet 
A,y)  correspondingly  to  denote  a point  on  the  bottom  sheet.  Then,  as 
of  the  proof  of  Theorem  1.25,  we  seek  a function  v(x,y)  such  that 

w( x,y ) = w(x,y)  . 

. :ingly  we  are  reduced  to  a strictly  Diriclilet  problem,  since  the 
and  lower  edges  of  the  rectangle  are  no  longer  boundaries.  Hence, 
iuffices  to  appeal  to  the  existence  theorem.  Theorem  I. 12  on  p.8  of 
. i I 1 I,  for  the  Dirichlet  problem.  Similarly,  using  the  same 
' ■ t.xy  prolongation"  of  our  large  rectangle,  we  can  reduce  the 
• •-less  f w(x,y)  to  uniqueness  for  a strictly  Dirichlet  case, 
id  illy,  let  w(x,y)  be  a solution  on  the  rectangle,  whose  uniqueness 
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is  to  be  deduced.  Define  w for  the  Riemann  surface  by 
v(x,y)  = v(x,y)  = w(x,jr). 

Then,  since  w(x,y)  and  its  derivatives  of  first  order  have  continuous 
limits  on  the  top  and  bottom  edges  of  the  rectangle,  the  same  is 
true  of  w(x,y)  and  w(x,^).  Also,  w(x,y)  and  w(x,y)are  equal  and 
have  equal  normal  derivatives  (namely  zero)  along  the  top  and  bottom 
edges  of  the  rectangle.  So,  by  Theorem  VI  on  p.26l  of  Kellogg  L83, 
w is  a harmonic  function  on  the  Riemann  surface.  Thus  we  are  reduced 
to  a strictly  Dirichlet  case,  and  uniqueness  follows  from  the  general- 
ized principle  of  the  maximum.  Theorem  1.2  on  p.2  of  Tsuji  [ T 3 . 

There  are  numerous  errors  in  Tsuji  [7].  Most  are  easily  rectified 
by  an  attentive  reader.  We  should  note  that  in  his  proof  of 
Theorem  1.12  (existence),  Tsuji  uses  a stronger  version  of  Theorem  1.2 
(maximum  principle)  than  he  states  or  proves.  In  Theorem  1.2,  the 
statement  of  (i)  should  be  strengthened  to: 

(i)  If  <f>(z)  is  subharmonic  in  D and  for  each  SeT, 

lim  <f>(  z ) £ M , z e D , 

z ■+  C 

then  4>(z)  < M in  D. 

Statements  (ii)  and  (iii)  should  similarly  be  strengthened. 

To  prove  the  strengthened  version  of  (i),  suppose  that  there  is  an 
e > 0 such  that  there  are  points  of  D for  which  $(z)>M+e. 

If  there  are  such  points,  the  totality  of  them  can  be  shown  to 
constitute  a closed  region,  completely  interior  to  D,  because  <J> ( z ) 
is  continuous  in  D,  since  that  is  part  of  the  definition  of 
"subharmonic".  But  then,  in  this  closed  region,  <f>(z)  must  attain 
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its  maximum  value,  which  will  be  the  maximum  value  of  ifi(z)  in  D.  Then 
the  rest  of  the  proof  proceeds  /ery  similarly  to  the  proof  given  in 
Tsuji  for  the  weaker  version  of  (i). 

It  happens  that  in  his  proof  of  Theorem  1.12  (existence),  Tsuji  uses 
the  denumerable  axiom  of  choice.  As  this  axiom  seems  indispensable  for 
the  theory  of  Lebesgue  integration,  most  analysts  could  not  reasonably 
object  to  Tsuji 's  proof.  However,  for  our  present  purpose  we  can  dispense 
with  the  axiom  of  choice.  For  the  benefit  of  any  purists  among  our  readers 
we  shall  indicate  how  to  do  this. 

If  u(x,y)  exists,  and  we  define  w(x,y)  by  (2.2)  then  by  Theorem  VI 
on  p.26l  of  Kellogg  [8]  w(x,y)  is  harmonic.  If  we  then  define 

w(x,y ) = w(x,y)  -1 , 

it  will  be  seen  to  have  the  required  properties.  So  we  are  reduced  to 
the  question  of  the  existence  of  w(x,y).  If  we  follow  this  back,  we  are 
reduced  finally  to  the  question  of  the  existence  of  a u(x,y)  satisfying 
(l.l)  through  (1.6).  But  this  existence  can  be  concluded  by  means  of  the 
conformal  transformation  defined  in  Whiteman  and  Papamichael  [1]  . 

Thus,  for  the  case  at  hand,  we  do  not  need  to  invoke  Theorem  1.12, 
the  existence  theorem  of  Tsuji  C7].  The  maximum  principle.  Theorem  1.2, 
from  which  we  can  deduce  uniqueness,  is  proved  without  using  the  axiom  of 
choice. 

It  will  be  noted  that  both  w(2-x,y)  and  ~w(x,y)  are  harmonic  inside 
the  large  rectangle  with  corners  (-3, ± 3)  and  (5,±  3)  except  along  the  six 
"slits",  and  both  satisfy  the  same  boundary  conditions  around  the  boundary 
of  the  rectangle,  and  along  each  of  the  "slits".  So  we  conclude  that 

(2.3)  w(2  -x,y)  = - w(x,y)  . 
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From  this,  we  infer  that 


(2.4)  w( 1 ,y ) = 0. 

So,  inside  the  rectangle  with  corners  (-3, ±3)  and  (l,±3)  , 
u(x,y)  and  1 + w(x,y)  are  both  harmonic  except  along  the  three  "slits", 
and  both  satisfy  the  same  boundary  conditions  around  the  boundary  of  the 
rectangle,  and  along  the  "slits".  So,  inside  this  rectangle  we  have 

(2.5)  u(x,y)  = 1 + w(x,y ) . 

Then  by  (2.2)  and  (2.3),  we  conclude  that 
w(x,y)  = 1 + w(x,y) 

holds  in  the  large  rectangle.  So  w(x,y)  is  harmonic,  as  claimed. 

We  may  think  of  w as  continuing  u into  the  large  rectangle. 

We  now  restrict  attention  to  the  circle  of  radius  2 with  center 
at  the  origin.  Inside  this  circle  u is  harmonic  except  along  the  "slit" 
defined  by  y = 0,  -2  < x s 0 ; along  this  "slit"  it  is  continuous  and 
equal  to  0. 

Recall  (1.7),  in  which  we  took 
z = x + iy. 

Then  u(x,y)  is  the  read,  part  of  a function  u(z)  wh.  ch  is  analytic  inside 
the  circle  except  along  the  "slit". 

By  adding  a constant  if  need  be,  we  can  determine  that  u(0)  =0. 

As  uy(x,y)  = o along  the  positive  real  axis,  we  conclude  by  the 
Cauchy-Riemann  differential  equations  that  u(z)  is  real  along  the  positive 
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real  axis.  So 


(2.6)  u(z)  = u(z)  . 

We  transform  from  the  z-plane  to  the  £ -plane  by  the  transformation 

(2.7)  z = C2  . 

The  "slit"  goes  into  the  part  of  the  imaginary  C~axis  from  -i/2  to  i/2. 
The  rest  of  the  interior  of  the  circle  goes  into  the  interior  of  a semi- 
circle of  radius  /2  with  center  at  the  origin  lying  to  the  right  of  the 
imaginary  £-axis. 

Inside  the  semi-circle,  we  define  an  analytic  function  v(c)  by 

(2.8)  v(c)  = uU2). 

By  (2.6)  we  have 

(2.9)  v(0  = 7U)  ■ 

Along  the  imaginary  £-axis  from  -i/2  to  i/2,  we  have  by  (2.8)  that 
the  real  part  of  v(z)  is  zero.  Then  by  (2.9)  we  have 

(2.10)  v(-c)  = - vU) 

along  that  part  of  the  imaginary  axis.  We  define  v(c)  by  (2.10) 
for  £ in  the  left  half  of  the  circle  of  radius  /2  with  center  at 
the  origin.  It  is  easily  verified  that  the  real  part  of  v(c)  is 
harmonic  inside  the  entire  circle;  compare  our  earlier  reasoning 
concerning  w.  By  the  Cauchy-Riemann  differential  equations  and  the 
fact  that  v(z)  is  continuous  across  the  imaginary  axis  by  (2.10), 
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we  conclude  that  v(?)  is  analytic  in  the  entire  circle.  Thus, 
it  can  be  expanded  in  a power  series 

oo 

(2.11)  vU)  = l b Cn 

n*0 

with  radius  of  convergence  at  least  /2. 

By  (2.9),  we  conclude  that  the  b^  are  all  read.  By  (2.10),  bR  is  zero 
when  n is  even.  Thus,  by  (2.7)  and  (2.8)  there  are  real  a^  such  that 

(2.12)  u(  z)  = J a zn  + s . 

n=0  n 

This  is  (1.8),  and  by  taking  the  real  part  of  both  sides  we  conclude 
that  (1.9)  holds,  namely 

(2.13)  u(r  cos  9,r  sin  0 ) = Tar11  5cos(n  + j)0. 

-r,  n 

n=0 

We  see  that  (2.12)  has  a radius  of  convergence  at  least  as  large  as  2. 
However,  the  radius  of  convergence  cannot  be  greater  than  2.  For  suppose 
the  radius  of  convergence  is  R > 2.  Then  (2.13)  would  be  valid  for 
0 < r < R.  Then  we  could  infer  that  u^(x,y)  is  continuous  across  part 
of  the  "slit"  defined  by  y = 0,  2 s x s 4;  the  latter  is  not  the  case. 

So  the  series  (2.12)  has  radius  of  convergence  2. 

The  purpose  of  the  present  report  is  to  explain  a procedure  for 
calculating  the  a^,  and  to  present  approximations  for  a number  of  them. 

One  can  find  that  series  such  as  (2.12)  are  developed  in  Wasow  [2]. 

It  happens  that  the  series  developed  there  are  only  shown  to  be  asymptotic. 
However,  this  does  not  exclude  the  possibility  that  in  certain  cases, 
such  as  the  one  we  are  considering,  the  series  is  actually  convergent. 
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3.  Some  algorithmic  considerations. 


In  order  to  calculate  the  an>  it  is  desirable  to  be  able  to  manipulate 
power  series  on  a computer.  We  shall  present  some  algorithms  for  this 
purpose . 

Suppose 

CO 

(3. 1 ) F(x)  = l F x11  . 

-1  n 
n = 1 

Define  as  the  coefficient  of  xn+^  1 in  (F(x))^.  That  is, 

(3.2)  (F(x))j  = l F^  xn+j_1  . 


We  clearly  have 

(3.3)  F(1)  = F , 

n n 


(3.U) 


F(j+1) 

n 


n 


I 

r=1 


F^ 


F^j 

n+1-r 


If  we  know  Fn  for  1 < n S N,  then  F^J } can  be  calculated  for  as  large  a 
desired  and  for  1 S n < N by  means  of  (3.3)  and  ( 3. U ) . 

If  only  the  values  of  ^ for  1 s n s N are  desired,  one  can  get  them 
without  calculating  F^J } f or  1 < j < j by  the  following  stratagem  of 
Nijenhuis  and  Wilf  [9].  By  (3.2), 

J logF(x)  « log  f F(J)  Xn+J-1 
n-1  n 


If  we  differentiate  both  sides,  and  clear  fractions,  ve  get 


n=  1 


J l F^)xn+J"1  l k F -k“1 


k = 1 


= 1 (n+J-l)F(J)xn+J~2  J F xk 

Ir 


n=1 


k=l 
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r 


M+J 


If  we  equate  the  coefficients  of  x‘  on  both  sides  we  get 


M+1  m M+1  / T \ 

J j F<  )(«-2-r,)FM+2_n  - I (n«-l)F>J»Fw2_n 
n- 1 n= 1 


So,  for  M > 1,  we  have 

”£♦!  F,  = ji(J(«*'-n)-n*l)F<J»FHt2_n. 

If  F ^ 0,  this  can  be  used  to  calculate  F^J  ^ recursively, 
beginning  with 

f;j)=  ( f 1)J  . 

Note  that  J need  not  be  an  integer. 

Clearly,  if 

oo 

(3.5)  G(x)  = l Gnxn_1  , 

n=1 

then 

oo 

xG  ( x ) = y G x11 . 

, n 
n=1 

So,  if  we  define  G^^  by 
n 


(3.6) 


= G 

n 


(3.7) 


G(j+1) 

n 


n 

l a g 

_ . r 


(j) 

n + 1-r 


we  will  have 

(xG(x)/=  Jg^x8^’1  , 

n=1 
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so  that 


where  F(x)  is  given  by  (3.1)  with 

B 


(3.14) 


F = — 
n B 

o 


1 s n. 


Then,  if  we  define  C by 

n 

(3.15)  — = — fl  + l C xn]  , 

B(x ) B - n=1  n 

o 


we  have 

(3.16)  1 = (1  + F(x) } f 1 l CnxI1'j. 


n=1 


M + 1 


Equating  the  coefficients  of  x on  both  sides  gives  the  recursion 

M 


(3.17) 


CM+1  = - F' 


1 - y c f 

M+ 1 , n M+ 1 -n  ’ 

n=  1 


1 S M. 


From  this  we  have  C = - F,.  If  C,  Col  ,C„  have  been  calculated  and 

1 I 1 , d M 

stored  in  the  computer  then,  by  (3.17),  CM+1  can  be  calculated  and  stored 


Vj 


in  the  computer. 

Consider  next  the  question  of  a power  series  expansion  for  (B(x))  ^ , 

where  J is  greater  than  unity  and  need  not  be  an  integer.  By  (3.13),  we 
have 


(3.18) 


( B(  x ) 


Vj  Vj 


1/.T 


B0  (1  - F(x) ) . 


We  define  D by 
n * 


(3.19) 


1 + l D xn  = ( 1 + F(x))/J  . 


n=1 
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Then,  by  the  stratagem  of  Nijenhuis  and  Wilf  [9],  we  have 

(3.20)  fl  + l D xn  l kF  xk_1  = J l nD  x11"1  f 1 + £ F xk  1 
^ n=1  Jk=1  n=  1 n ' k=1  * J 


If  we  equate  the  coefficients  of  x on  both  sides  of  (3.20),  we 
get  for  M > 0 

(3- 21)  (M+1)FM+1+  lDn(M+1-n)FM+1_n  = j[(M+1)DM+1+  f nD^ 

n=  I v n=  1 J 


So  for  0 £ M 


(3.22)  D, 


M+1 


j(m+i; 


From  this,  we  have 
(3.23) 


Di  = T • 


If  I>2»  •••»  ^ave  I*een  calculated  and  stored  in  the  computer, 
then  by  (3.22)  D ^ can  be  calculated  and  stored  in  the  computer. 

It  will  turn  out  eventually  that  we  shall  have  calculated  the 
first  M coefficients  of 

(B(x))!  = BJ  ( 1 + l D xn) 

0 n-1  n 

by  (3.22)  and  (3.23),  and  shall  have  calculated  a number  of  coefficients 
E-,»  E2,  ....  H^,  and  shall  wish  to  solve  for  A^ , Ag,  ...,  Aj^  from  (3.10) 
and  the  identity 

(3.21*)  l En(u(z))2n_1  = (zB(z))J. 

n=1 
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By  (3.10)  and  (3.11)  we  have 


(3.25) 


:1  l 

n-  1 


A z 
n 


n-1 


= B ( 1 
o 


l Dr 


n , 
z ) 


n=  1 


A(2r-1)  z m-2  + r 
m 


Clearly 

(3.26  ) 


Suppose  that  A^,  A0,...,A^  have  been  calculated  and  stored  in  the 

computer.  Let  us  also  calculate  and  store  A^  > a!,^  a^  for 

1 < j s 2N  + 1,  using  the  analogues  of  (3.3)  and  (3.4).  Then  by  (3.25) 
we  have 


(3.27) 


\ +1 


N +1 

l E 

r =2 


(2  r-1) 
r AN  +2-r 


Thus  ^n  + 1 can  be  calculated  and  stored. 

4.  A power  series  expansion  for  a Jacobi  elliptic  function. 

We  will  be  referring  for  the  next  few  paragraphs  to  the  NBS 
Handbook  L 3 J » and  shall  simply  cite  their  reference  numbers. 

As  in  17-2.1,  we  take  m to  denote  the  parameter.  As  in 
17.2.18,  we  define  the  complementary  parameter  to  be  m^,  given  by 

(4.1)  m ^ = 1 - m . 


As  in  17.3.1  and  17.3.5,  the  complete  elliptic  integrals  are  given  by 

//2 


(4.2) 


K(m)  = 


dO 


0 ( 1 - m s in2  6 )5 
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(4.3) 


K'(m)  = K(m1)  = 


% 


d0 


0 ( 1-m  sin2  0)J 


As  in  17.3.17  and  17.3.18,  we  define  the  nome  q (m)  and  the 
complementary  nome  q^m)  by 


(4.4.) 


(4.5) 


q(m)  = exp(  - tt  K'  {m) J K(m) ) 


q1  (m)  = q (m1 ) = exp  (-nK(m)/K'(m)). 


We  observe  that  K(m)  is  an  increasing  function  of  m. 

Hence  m is  a function  of  K,  and  indeed  an  increasing  one.  K'(m) 
is  a decreasing  function  of  m,  and  conversely.  We  infer  that 
q(m)  is  an  increasing  function  of  m,  and  conversely,  while  q^m) 
is  a decreasing  function  of  m,  and  conversely.  As  an  increasing 
function  of  an  increasing  function  is  an  increasing  function,  we 
infer  that  K is  an  increasing  function  of  q.  Indeed,  by  17-3.22 


(4.6) 


2K  , y q 

— = 1 + 4 ) 

TT  L 


n=1  1+q 


2n 


For  purposes  of  computation,  a much  better  series  is  the  one  given 
by  16.38.5 


(4.7) 


' * ' n=  1 


(4.8) 


We  define 


R(q)  = 


2n 

K/in 
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By  16.38.7 


(J».9) 


(H(q ))"J  - q '"l  qn(n+,)  . 

n=0 


We  define 

“ n+i 

(4 . 10 ) S(q,v)  = l — ^ ->1  sin  (2n+1)v. 

n=0  1-q  n 

Then,  according  to  16.23.1,  the  Jacobi  elliptic  function  sn(u|m) 
is  given  by 

sn(u|m)  = R(q)  S(q,-~ ^ ). 

cK 

We  observe  that 

» n+J 

i S(q,-iv)  = £ — sinh  (2n+l)v. 
n=0  l-q^11 

Clearly  the  right  side  of  the  equation  above  converges  for  0 S v < -(£nq)/2 
For  0 ^ v < — v it  11  q)/2  let  us  expand  sinh  (2n+1)v  as  a power  series  in  v 
for  each  value  of  n.  This  expresses  i S(q,  - iv)  as  a multiple  series, 
each  term  of  which  is  non-negative.  So  the  multiple  series  is  absolutely 
convergent.  Thus,  the  terms  can  be  rearranged  at  will.  So  we  get 


i S(q,-iv) 


■ l M)J 

r=0 


Ar(q)v' 


2r+1 


where  we  define 


(4.12) 


ArW) 


(2r+1): 


00 


l 

n=0 


2r+1 
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As  the  power  series  shown  for  i S (q,  -iv)  has  all  positive 
coefficients,  and  converges  for  0 $ v S -(in  q)/2,  it  converges 
absolutely  for  0 S [v|  < -(£.n  q)/2. 

Thus  we  infer 


(^. 13)  S(q,v)  = l A (q)v"r+1  . 

r=0 


Then  by  (4.11),  we  have 


(U.lU)  sn(u|m)  = R(q)  £ A 

r=0 


.(<l) 


( 

\ 2K  / 


2r+1 


Professor  Alan  Talbot  has  called  to  our  attention  a much  better 
way  to  calculate  the  coefficients  of  sn(u|m).  By  17.2.2.  and  17.2.7 
of  the  N.B.S.  Handbook  C 3 J , we  have 

sn(u) 

u=  { ( 1 - t2 ) ( 1 - mt2 ) } dt. 


Differentiating  with  respect  to  u 

fa. 


(4.15) 


'I2 


. sn(u)  ( = 
du  J 


gives 

(1  -(sn(u))2)(l  -m(sn(u))2). 


Differentiating  again  gives 

d2 

(4.16)  — — sn(u)  = 2m(sn(u))'3  - (l+m)sn(u). 
du 

By  (4.12)  and  ( 4 . 14 ) we  have 

OO 

( 4 . 17 ) sn(u|m)  = £ (-1)r  g u2r +1  , 

r=0  r 
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where  the  gr  are  positive  real  numbers.  Taking  u = 0 
in  (4.15)  gives 

(4.18)  go  = 1. 

If  we  define 


(4.19)  F„  = 0 , 

2r 

H.20)  F2r„*  <-')r8r. 


then  by  (3.1)  we  have 


(4.21)  sn(u|m)  = F(u). 

So  by  (3.2) 

(4.22)  (sn(u|m))3=  £F(3>un+2. 

n=1  n 

We  have 

(4.23)  f!|3)  = 1 , 

and  by  the  stratagem  of  Nijenhuis  and  Wilf  [9) 

(4.24)  MF(3)  = l (3M  + 4 -4n)F(3)F 

M+1  n“i  n rM+2-n 

Let  us  define  G by 
r J 

(4.25)  (sn(u|m))3  = £ (-l)r  G u2r+3  . 

r=0  r 


Then  by  (4.24) 


(4.26) 


(M+1  )G 


M+1 


M 

= I (3M+3-4r)G 
r=0  r 


g M+1-r  " 
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We  have  of  course,  by  (4.18) 


(4.27)  G = 1 

o 


Comparing  coefficients  of  u in  (4.16)  gives 


(4.28) 


Substituting  (4.17)  and  (4.25)  into  (4.16)  gives 
(4.29)  (2M  + 5)(2M+4)gM+2  = 2mGM  + (1  +m)gM+i  * 

If  we  have  calculated  and  stored  go>g^,...,g^+  , then  by  (4.26) 

and  (4.27)  we  can  calculate  and  store  G ,G,,...,G„  Then  by  (4.29) 

o i M+ 1 

we  can  calculate  and  store  g^+2 . Hence,  starting  with  gQ  and  g^ 

from  (4.18)  and  (4.28),  we  can  calculate  g^  for  as  large  N as  desired. 


We  note 


1 + 1 4m  + i 


1 + 135m  + 135m2  + m3 


1 + 1228m  + 5478?/  + 1228m3 


1 + 1 1069m  + 165826m"  + 165826m3  * 1 1069m14  + 


3 + 3m 
3!  * 

13  -i-  62m -i-  13mg  , 


_ 205  + 33 15m -t- 33 15m2  205m3  . 


4 5 

1 + m 


80 


r 


The  values  of  gQ,g1,g2  and  g^  are  given  in  16.22.1  of  the 
NBS  Handbook  [3],  followed  by  the  comment:  "No  formulae  are  known  for 
the  general  coefficient  in  these  series".  However,  given  a numerical 
value  of  m,  one  can  quickly  calculate  as  many  as  might  be  desired  of 
the  g by  (U. 18) ,(4.28) , (4.27),  (4.26)  and  (4.29). 

r 

In  the  present  report,  we  make  a check  on  our  calculation  of  the 
coefficients  of  sn(ujm)  by  carrying  out  the  calcination  both  by  ( 4 . 1 4 ) 
and  by  (4.18),  (4.26),  (4.27),  (4.28)  and  (4.29).. 

5.  A succession  of  conformal  transformations. 

We  shall  follow  the  procedure  outlined  in  Whiteman  and 
Papamichael  [ 1 ] . 


The  z-  plane 
Figure  1 . 

In  Fig. 1 we  depict  the  rectangle  of  Section  1.  We  are  using  a single 
complex  variable  z,  so  that  each  point  in  the  plane  is  designated  by 
a single  complex  number.  Thus  the  corners  of  the  rectangle  have  the 
designations  (±1)  and  (±1  +i)  . The  points  A,B,C,  and  D are  located 
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at  -1,0,1,  and  1 + i.  We  wish  an  analytic  function  u(z)  whose  real 
part  has  the  value  zero  on  the  line  segment  from  A to  B and  the  value 
unity  on  the  line  segment  from  C to  D,  and  has  a zero  normal  derivative 
on  the  line  segment  from  B to  C and  from  D around  to  A along  the  top 
and  left  sides  of  the  rectangle. 

We  will  determine  u(z)  by  defining  a succession  of  analytic  trans- 
formations. These  will  entail  conformal  transformations  of  the  rectangle 
into  a succession  of  regions. 

We  first  take 

(5.1)  m = 5 , 
and  write 

(5.2)  z2  * Kz  , 

where  K = K(m)  is  defined  by  (4.2).  With  m = we  have  K' (m)  = K(m) 
by  (4.3).  The  rectangle  is  mapped  onto  a larger  rectangle,  with  corners 
at  ( ± K)  and  (±K+iK').  The  points  A,B,C,D  go  into  points  An=-K, 

B2  = 0,  C?  = K,  and  D2  = K +iK'  = K + iK. 

We  now  take 

(5.3)  z^  = sn(z0)  = sn(z2|m). 

The  rectangle  of  the  z2  - plane  goes  into  the  upper  half  of  the 
z^  -plane.  The  points  A2JB?,C2,  and  D2  go  into  points  A^=  -1, 

B3  = 0 , C3  = 1,  and  D3  = /If . 


82 


Next  we  put 


(5.1*) 


z4  = 


1 + z„ 


= 2 - 


1 + z„ 


The  upper  half  of  the  z^-plane  goes  into  the  upper  half  of  the 
z^  - plane.  The  points  C^,  and  go  into  points  A^=°°,  B^=  0 , 

Cu=  1,  and  = 2 /2/(  1 + /2 ) . 

Next  we  put 


(5.5) 


z5  = /z4  ' 


We  take  the  determination  of  the  square  root  so  that  the  upper 
half  of  the  z^  - plane  goes  into  the  first  quadrant  of  the  zr  - plane. 
The  points  A^B^.C^,  and  go  into  points  A,.  = °°  , B^  = 0,C^  = 1 , 
and  D,_  = \A/(2  +/ 2 ) . 


We  take 


(5.6) 


2 + /2 


(5.7) 


-1 


z6=  sn  ( 


V • 


that  is 


(5.8) 


z = sn(z- |m  ) . 
5 o 1 


We  write 


(5.9) 

(5. 10) 


K = K(m), 

K'  = K'(m)  =K(  1 -m)  ; 


see  (4.2)  and  (4.3).  Then  the  first  quadrant  of  the  z^-plane  goes 
into  a rectangle  in  the  z,-plane  with  corners  at  0,K,  iK',  and 
K + i K '.  The  points  Ar  B C , and  D go  into  points  A. = i K ' , 

0 , C,  = K,  and  Df  = K + i K ' . 


Finally,  we  take 


(5.11) 


The  rectangle  of  the  z^-plane  goes  into  a rectangle  in  the 
z -plane  with  corners  at  0,1,i(K'/K),  and  1+i(K'/K).  The  points 
A^,B^.>C^,  and  Dg  go  into  points  A^=  i(if'/K),  B^=0,C  =1,  and 
H7=  1 +i(K'/K). 

For  r = 2, 3,..., 7,  we  have  z an  analytic  function  of  z.  When 
z is  on  the  line  segment  from  A to  B,  z^  is  on  the  line  segment  from 
A.j,  to  B^,  and  has  real  part  0.  Proceeding  similarly  around  the 
rectangle  of  Fig.1,  we  verify  that  z is  the  function  u(z)  that  we 
were  seeking.  So  we  proceed  successively  to  determine  z„,z^,z  , and 
z^  as  power  series  in  z. 


6 . Determination  of  coefficients. 

By  (5.2),  (5.3),  and  ( 4 . 1 4 ) , we  have 

<x>  / .2r + 1 


,3  - MO 


As  m = \ by  (5.1),  we  have  K = K ' by  (4.3).  So  by  (4.4) 


= e s 0.0  43213  91826  37722  49774. 
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UNCLASSIFIED 


ARO-77-3 


Then  by  (4 . 12) 


(6.3) 


— = 1.1847  53167  61544  54453. 


Similar  high  accuracy  approximations  were  obtained  for  j /q  for 
r = 1 ,2, ... ,9. 

By  (4.9) 


(6.4) 


R(q)/q  = 0.99627  55376  22549  34254. 


Therefore,  we  had  high  accuracy  approximations  for  the  first  10  non- 
vanishing coefficients  in  the  expansion  of  (6.1 ). 

Because  m = 5 the  coefficients  in  the  expansion  of  sn(u|m)  are 
rational;  see  the  discussion  above  commencing  with  formula  (4.15).  To 
cancel  the  powers  of  j resulting  from  the  powers  of  m,  we  define  h^  and 

H by 
r 


(6.5) 


(6.6) 


sn(2v|?)  = l ( — 1 ) r h 


r=0 


r 2r+1 
v 
r 


r „ 2r+3 


( sn( 2v | j ) ) J = l (-1)  H v 
r=0 


Then  by  (5.2)  and  (5.3),  if  we  define 


(6.7) 


we  get 


(6.8) 


Kz 

2 ’ 


V / . >r  2r+1 

zo  = l ("D  hrv 

5 r=0 
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We  record  the  approximation 


(6.9)  | s 0.92703  73386  50685  95921  1*9  . 
By  ( 1* . 1 8 ) , ( 1* . 28 ) and  ( 1* . 27 ) , we  have 

(6.10)  h = 2 

o 

(6.11)  h = 2 

(6.12)  H = 8. 


By  (4.26),  we  get 


(6.13) 


2(M*’)HHtl  » £ (3H*3-kr)Hrh 
r=0 


and  by  (4.29) 

(6.14)  (2jM  + 5)  (2M+4)hM+2=  4HM+6hM+1  . 


Thus  we  get 


(6.15) 


= 2 


v3  + 


11  5 
— v' 
10 


L3V7 

10 


181  9 
120  v 


35J  11 
200 


+ 


31861  13 
15600  v 


1 85363  15  97  77931  17 

78000  35  36000 


26  25613  19  1 

8 16000  v + ’ • • J 


and 
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Then  by  (5-5) 


(6.18) 


z j.  = 2/v 


[i 


2 3 

v + v - v + 


21  1* 

20  V 


23  v5  + ?5  v6 

20  20 


2T  7 , 3487  8 1263  9 8l1  10  176  11 

20  v 2400  v ‘ 800  v + 480  v "96 


4 09023  12  13  23631  13  2 86667  14  1 02677  15 

2 08000  v ~ 6 24000  v 1 24800  " 4l600 


13556  ;g 1603  16  8124  34903  17  404  46213  18 

5091  84000  V “ 2828  80000  v 130  56000  V 


17016  35987  19  + 

5091  84000  v 


Our  ability  to  get  rational  coefficients , as  above,  depends  on  m 
) 

being  rational.  If  we  had  started  with  a rectangle  of  different 
proportions  this  would  be  quite  unlikely.  Of  course  we  can  always 
get  decimal  (or  binary)  approximations  for  the  coefficients  of  z 
in  the  expansion  of  z ^ . This  can  be  done  either  by  (6.1),  or  by  the 
formulae  { 4 . 1 8 ) , ( 4 .28) , (4 .27)  , (4.26)  and  (4.29)  with  a numerical 
approximation  for  each  of  m and  K.  Then,  by  the  algorithm  embodied 
in  (3.15)  and  (3-17),  we  can  get  approximations  for  the  coefficients 
of  the  powers  of  z in  the  expansion  of  z,  . However,  in  a formula 
like  (3.17),  there  is  considerable  danger  of  cancellation  errors.  To 
test  this  out,  approximations  for  the  coefficients  of  powers  of  z in 
were  calculated  from  the  approximations  of  powers  of  z in  2,,  and 
compared  with  those  derived  from  the  exact  coefficients  of  (6. 17).  As 
expected  there  was  cancellation  error  from  the  use  of  (3.17).  From  the 
coefficient  of  z"  onward,  the  coefficients  computed  by  (3.17)  are 
correct  only  to  seventeen  significant  decimal  accuracy,  and  the 
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1 


19  • ■ ... 

coefficient  of  z is  correcl  nly  t<  : ixteen  significant  tec lma 1 

accuracy.  The  amount  of  agreement  is  enough  to  corroborate  that  there 

are  no  errors  in  the  coefficients  shown  in  (6.17). 

If  all  calculations  are  performed  using  double  precision,  as  they 
are  in  the  present  report,  ther.  thi:  amount  of  cancellation  error  is  not 
serious.  Indeed,  we  shall  see  that  it  is  of  little  consequence  compared 
with  more  serious  errors  that  arise  Later.  The  fact  that  (3.1?)  can  be 
safely  used  for  the  present  rectangle  gives  reason  to  believe  it  can 
be  safely  used  for  other  rectangles. 

By  the  algorithm  embodied  in  (3.18)  through  (3.23),  approximations 
for  the  coefficients  of  the  powers  of  z in  the  expansion  of  zr  were 
calculated.  There  was  no  additional  cancellation  error. 

As  before,  the  errors  in  the  calculated  coefficients  of  zr  are 
not  serious  if  double  precision  is  used.  The  agreement  s Tices  to 
corroborate  the  coefficients  shown  in  (6.18).  Also,  one  feels  that 
the  algorithm  will  be  satisfactory  for  rectangles  of  other  proportions. 

Recalling  (6.7),  let  us  identify  (6.18)  with  the  right  side 
of  (3.2U).  Recalling  also  that  z is  the  u(z)  that  we  seek  (see  the 
end  of  Section  5),  we  see  that  if  we  define  E by 


(6.19) 


z 


5 


00 


z 


2n-1 

7 


then  the  algorithm  embodied  in  (3.2b)  and  (3-27)  will  furnish 

approximations  for  the  a (see  (3.9)  and  (3.10)). 

n 

To  calculate  approximations  for  the  in  ( 6 . 1 9 ) * we  proceed 
as  follows.  We  define  m,  K,  and  K' . by  (5. 6), (5. 9),  and  (5.10). 
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By  the  procedures  of  Rosser  [4],  which  elaborate  some  of  the 
techniques  given  in  the  NBS  Handbook  [3],  hand  calculations  were 
performed,  giving  the  approximations 


L 


(6.20)  -±-  S 0. 65447  27107  79001  74587  9748 

2K 


(6.21)  -^-=0.96156  31078  83199  52969  2696. 

2K' 


Internal  checks  on  the  computation  were  as  follows.  We  defined 


aQ,  bQ,  and  c^  by  (2.34),  (2.38),  (2.40 ),  and  (2.4 1)  of  Rosser  [ 4 J . 


Given  a , b , and  c , we  calculated  a , , b ,,  and  c by  (2.39), 
n n n n+1  n+1  n+1 

(2.43),  and  (2.88)  respectively  of  Rosser  [4].  Then  cn+i  was  checked 
by 


Cn+1  an  an+1  5 


which  follows  from  (2.44)  and  (2.39)  of  Rosser  C 4 J , b^  was  checked  by 


b = a -2c  , 

n n n+1 


which  follows  from  (2.44)  of  Rosser  [4],  and  n^+  ^ was  checked  by 


(an+1)2  = (bn  + 1),!  + (cn+1)2  * 


which  follows  from  (2.40)  and  ( 2 . 4 1 ) of  Rosser  [4].  Incidentally, 
the  final  relation  furnishes  at  the  same  time  a check  of  b 

n+1 

The  approximations  shown  in  (6.20)  and  (6.21)  were  written  down 
from  the  relation 


2K  2 a5 
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1 


and  the  corresponding  relation  for  K'  (see  (2.49)  of  Rosser  [4J  ) • 
By  (6.20)  and  (6.21),  we  get  the  approximations 

(6.22)  — s 1.5279  47588  23439  12569  75 

TT 

(6.23)  it  §'  = 2.1382  75317  86718  76977  62. 

K 

Then,  by  (4.4)  and  (4.8),  with  q = q(m)  and  R = R(q),  we  get 


(6.24)  R s 2.8335  84629  80564  5 1476  09 

(6.25)  q =0.11785  79353  11857  71914  15 

(6.26)  R s 0.97278  21712  72939  91787. 


As  a further  check,  2K/tt  was  recomputed  from  (4.7).  The  agreement 
with  (6.22)  gave  a check  of  q to  21  significant  decimal  digits. 

Then  R /q  was  recomputed  from  (4.9),  thereby  checking  (6.26)  (and 
indirectly  (6.24)),  and  further  confirming  the  approximation  for  q. 
By  (6.19),  ( 4 . 1 4 ) , (5.8),  and  (5-11),  we  have 

- /,  A2"*' 

(6.27)  Entl  . „ — (|  ) . 


A double  precision  calculation  gave  the  approximations 


E s 2.4000 

94459 

13370 

29880 

-4.2710 

91276 

71015 

87231 

e3  s 9.0780 

50148 

84322 

90154 

E^S  - 19.589 

47213 

73987 

47225 
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S 5 

42.271 

31796 

55088 

65669 

E6  5 

-91 .247 

90687 

42185 

21666 

K7  =- 

196.96 

71903 

29320 

44722 

E8  * 

-425.17 

36249 

74564 

47812 

V 

917.78 

04737 

90827 

92546 

E105 

-1981.1 

224 1 1 

37532 

06857 

E1  Is 

4279.4 

54055 

91231 

19779 

E12S 

-9231.1 

60670 

91645 

55305 

E13~= 

1 )92(  . 

39841 

97583 

22048 

Ei4S 

-43013. 

15599 

83815 

48733 

E16S 

92848. 

2684i 

09464 

34459 

r 16  s 

-20042 

2.4230 

19509 

75164 

E17” 

4326.7 

2. 1679 

06704 

27497 

El8s 

-93388 

0.50  4 

34001 

30625 

E195 

20158 

75.965 

28174 

33942 

E205 

-43514 

73.133 

35277 

44512 

Approxi".  .'  i ms  ffi  :ients  K wen'  a 1 so  calculated,  using 

n ° 

double  precision,  by  the  algorithm  embodied  in  (4.15)  through  (4.29). 
There  was  agreement  between  the  two  sets  of  aj  proximations  to  seventeen 
significant  decimal  accuracy. 

In  preparation  for  the  use  of  (3.26)  and  (3.27)  we  substitute  (6.7) 
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into  (6.18)  to  get 


(6.28) 


z_  = /z"(BJ  + y (B5  D )zn) 
5 o o n 

n=1 


From  the  succession  of  transformations  in  Section  5,  it  is  clear 
that  is  infinite  when  z =-  1.  Indeed,  z^_  has  a pole  at  z =-1, 
so  that 


(-  1 ) D 

o n 


should  approach  a limit  as  n goes  to  infinity.  From  our  calculated 
approximations,  it  appears  that  this  limit  is  approximately 


1.52552. 


So,  by  (3.27),  we  see  that  A^+1  is  calculated  by  subtracting  a 
sum  from  a number  approximately  equal  to  1.5  and  then  dividing  by 
Ej,  which  is  approximately  2.b.  As  |A^  | is  quite  small  for  the 

larger  values  of  N,  there  will  have  to  be  quite  severe  cancellation 
errors.  In  addition,  there  will  be  rapid  growth  of  round  off  errors 
as  we  progress  to  larger  N . In  (3.27)  the  coefficient  of  A,  on  the 


right  is 


As  we  shall  see  shortly, 


-3E2»VEr 


A1  5 0.80232 


Thus,  to  compute  Ajj+i*  we  multiply  A^  by  approximately  3.^.  Thus 
we  multiply  the  error  of  A^  by  a similar  amount.  So  it  appears 


that  in  going  from  to  A^+2  we  can  expect  to  lose  about  one 
decimal  digit  of  accuracy  on  the  right  due  to  round  off  error. 

That  is,  we  cannot  expect  to  have  more  than  10  decimal  digits 
correct  to  the  right  of  the  decimal  point. 

With  this  in  mind,  we  write  down  the  approximations  we  obtained 
for  the  A^.  We  have  dropped  off  the  final  digits  that  are  almost 
certainly  wrong.  Perhaps  the  last  digit  or  two  that  we  list  is  in  error, 
but  we  rather  doubt  if  any  more  are. 


A1  ■ 

0.80232 

49074 

90468 

a2  - 

0.17531 

18403 

90175 

a3  s 

0.03447 

58301 

58893 

\ s 

- O.Ol6l4 

24305 

19396 

A5  5 

0.00288 

05454 

34045 

A6  5 

0.00066 

21097 

71841 

A7  -= 

0.00055 

08746 

89018 

A8  5 

-0.00017 

38659 

89051 

a9  « 

0.00006 

72097 

56853 

A10  £ 

0.00003 

07687 

48965 

A11  * 

0.00001 

46046 

03348 

A12  3 

- 0.00000 

63682 

27833 

A1 3 S 

0.00000 

24412 

9222 

Atk  5 

0.00000 

10619 

3096 

84047 

83405 

6187 

2687 

715 

473 

36 

07 

1 


l15  * 

0.00000 

05430 

244 

l1o£' 

0.00000 

02400 

927 

l17  2 

0.00000 

01010 

80 

ll6  = 

0.00000 

00463 

34 

l19  2 

0.00000 

00230 

7 

‘20’  ~ 

0.00000 

00105 

9. 

If  z ^ is  bounded  for  |z|  = 2,  as  it  most  likely  is,  then  by  the 
Cauchy  integral  theorem  there  is  a constant  k such  that  for  1 £ n 


Ah  - 


At  first,  the  drop  off  much  more  rapidly  than  this,  but  it 
appears  from  the  tabulated  approximations  that,  from  about  Ag  onwards, 
the  value 


does  not  change  greatly  with  n,  and  does  appear  to  be  tending 

towards  2.  This  hints  at  a tantalizingly  uniform  behavior  of  the  A . 

n 

In  this  connection  one  is  struck  by  the  fact  that,  as  far  as  our 
tabulation  goes,  every  fourth  A^  is  negative. 

For  z = 1,  the  sum  of  the  series  should  be  unity.  The  sum  of  the 
coefficients  shown  is 

0.99999  99924 . 
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. f,  in  accordance  with  the  "uniformity"  we  observed  above,  we  guess 


A01  --  50  x 10 
A22  s 20  x 10'10 
A23  ? 8 x IQ' 10 


A2l+  = -3  x 10 


- '0 


A25  = lx  10 


-10 


we  will  get  the  desired  sum  of  unity. 

In  Whiteman  and  Papamichael  [1],  in  Tables  1 and  2,  are  listed  values 
• f R,  ( z ) for  a variety  of  values  of  z.  The  series  was  summed  for  these 
values  of  z,  and  the  real  part  was  compared  with  the  eight  place  values 
from  which  the  entries  in  Tables  1 and  2 of  Whiteman  and  Papamichael  [1] 
were  rounded.  The  agreement  was  perfect,  except  for  values  of  z with 
jz  | 2 1,  for  which  the  absence  of  the  coefficients  A.^,  A00,...  led  to 
inor  discrepancies. 

Other  boundary  conditions. 

In  Whiteman  and  Papamichael  [1],  u conformal  map  from  the 

rectangle  in  Fig.1  of  Section  5 to  the  final  rectangle  in  the  z^-plane 

see  Section  5)  was  derived,  essentially  as  we  have  explained  it  in 

ection  5-  Our  coefficients  A are  merely  the  coefficients  of  the  power 

n 

cries  expansion  of  this  transformation.  Given  a point  in  the  z-plane, 
ne  can  find  the  corresponding  point  in  the  z^-plane  either  by  the 
•'t.hods  of  Whiteman  and  Papamichael  [1],  or  by  using  our  approximations 
r the  coefficients  to  compute  an  approximate  sum  for  the  series.  We 
have  given  enough  coefficients  that  one  can  get  high  accuracy  except 
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near  the  two  upper  corners.  It  would  be  a straightforward 
calculation  to  get  more  coefficients,  so  that  one  can  get  high 
accuracy  everywhere  in  the  rectangle.  One  would  have  to  use  higher 
precision  than  double  precision,  but  most  large  computing  centers 
have  software  capable  of  doing  this. 

The  point  of  the  conformal  transformation  is  that  boundary 
conditions  which  are  difficult  to  satisfy  in  the  rectangle  in 
Fig. 1 may  be  transformed  into  boundary  conditions  which  are  easy 
to  satisfy  in  the  rectangle  in  the  z^,-plane.  Thus  the  difficult 
boundary  conditions  (1.2)  through  (1.6)  are  transformed  into 
entirely  trivial  boundary  conditions  for  the  rectangle  in  the 
z^-plane . 

If  one  should  modify  the  Dirichlet  conditions  (1.2)  and  (1.6), 
one  would  have  in  the  z^-plane  Dirichlet  conditions  on  two  ends, 
and  zero  normal  derivatives  on  the  other  two  sides.  Such  problems 
are  fairly  routine.  If  one  should  modify  the  condition  (1.3)  to 


(t-1  : 


u (x,0)  = f ( x ) 


0 < x < 1 , 


one  would  still  have  for  the  rectangle  in  the  z^-plane  Dirichlet 
conditions  on  two  ends  and  Neumann  conditions  on  the  other  two  sides. 
There  are  familiar  techniques  which  are  quite  adequate  to  deal  with 
this,  except  for  one  difficulty.  That  is  to  determine  what  is  the 
Neumann  condition  in  the  z^-plane  corresponding  to  (7-1)*  We  now 
have  a way  to  deal  with  this. 


1 
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As  we  noted  in  Section  1,  u(x,y)  is  the  real  part  of  an  analytic 


function  u(z).  But,  by  our  transformations,  z is  an  analytic 
of  z^.  Thus,  if  we  define 

(7.2  ) u^(z^)  = u(z)  , 


then  u^(z^) 


is  an  analytic  function  of  z_,.  Let  us  put 


(7-3) 


z^  = 4 + i n • 


Then  the  real  part  of  u^ 
denote  this  by 


will  be  a harmonic  function  of  4 and  n 


(T-1*) 


u (4  , n ) . 


We  wish  to  find  the  condition  corresponding  to  (7-1). 

As  z^  is  a function  of  z,  we  have  by  (1.7)  and  (7-3)  that 
n are  functions  of  x and  y.  Indeed,  we  have 

(7.5)  u U,n)  = u(x,y ) . 


So 


(7-6) 


77  u(x.y)  = 


_3 

34 


, , >31.3  , . 3n 

U7u,n)  37  37U7U’n)  37 


For  0 < x < 1 and  y = 0,  we  have  0 < 4 < 1 and  n = 0. 


So 
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function 


. We 


4 and 


By  the  Caucjiy-Riemann  differential  equations,  this  gives 


f - o. 

9y 

Then,  by  the  Cauchy-Riemann  differential  equations,  (7*6) 
reduces  to 


(7.7) 


= |_u  (5,0)  . 


a7u(x’0)  = ?n  u7  3x 


We  have 


(7-8) 


i = I A 
7 n 

n=1 


n~2 


But  the  A^  are  all  real,  so  that  for  y = 0 and  0 < x < 1 we  have 


(7.9) 


g-i'-nv-13*’  • 


n=1 


Then,  by  (7-7)  and  (7-1),  we  can  determine  the  values  of  3u^( £ ,n ) /3n 
for  n = 0. 

The  series  on  the  right  of  (7-9)  is  more  slowly  converging  than 
the  one  on  the  right  of  (7.8).  However,  in  (7.1)  we  have  0 < x < 1, 
and  one  can  probably  get  adequate  accuracy  by  means  of  the  coefficients 
which  we  have  given. 

Along  the  left  side  and  top  of  the  rectangle  in  Fig.1,  the 
treatment  we  have  given  for  (7.6)  and  (7.7)  would  be  replaced  by 
something  more  complex.  Also,  we  would  replace  (7-9)  by  a more 
involved  relation.  Worse  than  that,  one  would  get  into  parts  of  the 
boundary  where  the  convergence  would  be  fairly  slow,  so  that  one 
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would  have  to  determine  more  coefficients. 

As  noted  above,  more  coefficients  can  be  determined,  but  it  is 

laborious.  So  we  seek  a better  procedure.  What  we  shall  do  is  to 

* 

specify  an  auxiliary  function  u (x,y)  which  satisfies  the  desired 

Neumann  conditions  along  the  left  side  and  top  of  the  rectangle  in 

« 

Fig.1.  Also  u (x,y)  and  its  partial  derivatives  can  be  calculated  at 

points  (x,y)  inside  or  on  the  boundary  of  the  rectangle.  Then 

* 

u(x,y ) - u (x,y)  has  zero  normal  derivatives  along  the  left  side  and 
top  of  the  rectangle.  Also,  we  can  determine  the  conditions  for 
u(x,y)  - u (x,y)  and  its  normal  derivatives  along  the  resl  of  the 

ft 

boundary.  Thus,  we  can  determine  u(x,y)  - u (x,y)  by  the  method 

indicated  earlier  in  the  section.  Then  u(x,y)  can  be  determined,  since 

* 

we  can  calculate  u (x,y). 

By  rotating  and  translating  the  rectangle  in  Fig.1,  we  can  restrict 
attention  to  the  case  in  which  the  rectangle  lies  in  the  first  quadrant, 

with  its  lower  left  hand  corner  at  the  origin,  and  we  wish  to  choose 

* 

u (x,y)  to  satisfy  stated  Neumann  conditions  along  the  x-axis  and  y-axis. 

# 

Nothing  is  specif ier  as  to  what  u (x,y)  shall  do  on  the  other  two  sides 
of  the  rectangle.  * than  that,  it  suffices  to  consider  the  special 
case  in  which  the  il  derivative  is  zero  along  the  imaginary  axis; 

by  reflecting  aboi  ue  1*5°  line  y = x,  we  can  interchange  the  x-axis 
and  y-axis. 

So  let  the  rectangle  have  parts  of  the  x-axis  and  y-axis  as  its 

bottom  and  left  side,  and  let  it  extend  from  0 to  A along  the  x-axis. 

» 

Let  us  find  u (x,y)  which  is  harmonic  inside  the  rectangle,  such  that 
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I 


(7.10) 

(7.11) 


ux(0,y)  = 0 


0 < y 


u (x,0)  = g(x)  0 < x < A. 


We  shall  assume  that  g(x)  is  of  bounded  variation  for  0 s x < A. 
We  shall  show  that 


(7.12) 


"■<*•*)  =27 


g(t)  log  {(x-t)2  + y‘~)  dt. 


2u 


2 2 

g(t)  log  { (x+t)  +y  } dt 


J0 


fulfills  the  stated  requirements.  Since  g(t)  is  of  bounded  variation, 

we  can  form  partial  derivatives  by  differentiating  under  the  integral 

# 

sign.  So  it  is  easily  seen  that  u (x,y)  is  harmonic. 

Also 


ux  < - i 


I g(  t ) ( x-t  )dt  + J_  ( g(  t ) ( x+t ) dt 


0 ( x-t )2  + y2 


11  h (x+t)2  +y2 


So  (7.10)  is  satisfied. 
We  have 


(7-13)  uy  ( x ,y  ) = ^ 


g(t ) y dt  J_ 

/ .2  2 TT 

0 (x-t)  + y 


g(t)  y dt 
0 (x+t)2  +y2 


Clearly  the  second  integral  on  the  right  approaches  zero  as  y + 0. 

As  g(t)  is  of  bounded  variation,  it  has  limits  g(x-0)  and 
g(x  + 0)  as  t approaches  x from  the  left  and  from  the  right 
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Specifically,  given  any  6 > 0 , we  will  show  that  for  all  y sufficiently 
close  to  0,  the  quantity  in  question  is  less  than  6 in  absolute  value. 
Given  6 > 0 choose  e > 0 sufficiently  small  so  that 

| g(t)  - g(x  -0 ) | < 6 
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¥ 


for  x - e S t < x.  Then  for  0 < y 


1 1 

X 

(g(t)-g(x-O)  }y  dt 

< 6_ 

X 

y dt 

TT 

, ,2  2 
(x-t)  + y 

V 

, , .2  2 

(x-t)  + y 

x-e 

x-e 

S c 6 

— arctan  — < — . 

it  y 2 


Choose  M so  that 


' - g(x-O)  | < M 


for  0 S i 5 Li.  for  0 < y 


2 

x-e 

(g(t)-g( x-0)  }y  dt 

x-e 

Mv  f dt  .My 

x-e 

dt 

1 TT 
! 

( x— t ) 2 + y ‘ 

» / ,2  2 TT 

J (x-t)  + y 

(x-t  )2 

0 

0 

0 

< 


ire 


For  0 < y < ( it  6 e ) / ( 2M ) , we  have 


_1_ 

n 


(K(t)-g(x-O)  }y  dt 

/ ,2  2 

(x-t)  +y 


< 


1 • 

2 


In  the  same  way,  we  show 


(7.17) 


Lim  J_ 
y ->0  it 


A 


x 


U(t)-g( x+0) }y  dt 
(x-t)2  + y2 


0 . 


So,  by  (7.13),  (7.1*0,  (7.15),  (7.16),  and  (7.17),  we  conclude  that 
Lim  u (x,y)  = -5-  {g(  x-0)  +g(x+  0)}  . 

y -*•  0 y 2 

If  g(t)  is  continuous  at  t = x,  then  (7.11)  holds. 
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r 


! 


If  g(t)  is  discontinuous  at  t = x , then  u (x,y)  can  be  made  to 

o y 

approach  any  limit  between  gtx^  -0)  and  g(xQ  + 0)  by  approaching 

the  point  (xq,0)  along  a suitable  direction.  However,  this  would 

* 

have  to  be  the  case  for  any  u (x,y)  which  satisfies  (7.11).  Thus 

* 

our  u (x,y)  comes  as  close  to  satisfying  (7.11)  at  x = x as  is  possible. 

# 

For  a given  value  of  x and  y,  one  can  approximate  the  u (x,y)  of 
(7.12)  by  numerical  quadrature.  If  y = 0,  one  could  avoid  singularities 
by  putting 

t = x ± e . 


Thus 


1 

2n 


A 

g(t ) log  (x  -t  )2 

•0 


dt 


g(x  - e s)se  s ds 
logx 


g( x+e  3 )s  e ds . 

- log (A-x) 

For  this,  a numerical  quadrature  would  be  quite  satisfactory;  one  would 
carry  the  integration  out  to  some  large  value  of  s,  rather  than  to 
infinity,  of  course. 

* 

If  one  should  wish  values  of  u (x,y)  at  a great  many  points,  a 
reasonable  procedure  would  be  to  approximate  values  of  u (x,y)  on  a 
grid  by  numerical  quadrature,  and  then  to  interpolate  by  the  spline 
function  interpolation  method  of  Papamichael  and  Whiteman  [5]. 

For  either  of  the  integrals  appearing  on  the  right  of  (7.12), 
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numerical  quadrature  will  be  quite  unsatisfactory  near  x = y = 0. 
However,  the  irregularities  in  the  two  integrals  will  mostly 
cancel  out  if  we  combine  the  two  integrals  into  a single  integral. 
Define 


(7.18) 


** 

g ( 


r «(t 

*>  -L<- 


g(_t) 


0 < t < A 
-A  < t < 0 


and  then  (7.12)  takes  the  form 

A 


(7.19)  u*(x,y)=^ 


**  2 2 

g (t)log{(x-t)  + y } dt 


In  this,  numerical  quadrature  is  still  unsatisfactory  near  x = A, 

* 

y = 0.  To  improve  this,  define  g (t)  to  satisfy 


(7.20) 

(7.21) 


g (t)  = g (-t) 


# ## 

g (t)  = g (t) 


- A < t < A 


with  g (t)  as  smooth  as  possible  for  A < t < B,  where  B > A;  if  g(t)  has 

* 

a left  derivative  at  t = A,  we  can  (and  should)  choose  g (t)  to 
have  a continuous  derivative  for  A S t SB.  Then  in  place  of 
(7.12),  we  can  define 


(7.22)  u ( x,y ) = — 


g*(t)log{(x-t)2  +y2  ) dt. 


This  will  be  a slightly  different  function  from  that  defined 
by  (7-12),  but  we  can  use  the  same  argument  as  we  gave  for  (7.12) 
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to  show  that 


u (0,y)  =0  0 < y 

and 

Lim  u (x,y ) = 4 {g  ( x-0)  + g(x  +0)} 
y > 0 y 

holds  for  0<x  < A;  at  x = 0 and  x = A the  limits  will  be  g(0  +) 

and  g(A  -0)  respectively. 

* 

If  g (t)  is  discontinuous  at  t = b,  with  0 < b < A,  then  numerical 
quadrature  of  the  right  side  of  (7*22)  will  be  very  poor  near 
x = b,  y = 0.  It  would  likely  be  worthwhile  to  use  the  techniques 

of  Section  7 of  Rosser  [6]  to  "remove"  discontinuities  of  g*(t) 

* 

before  defining  u (x,y).  Somewhat  the  same  is  true  of  discontinuities 

* 

of  the  derivatives  of  g (t).  Thus,  if  g(t)  has  a right-hand  derivative 

# 

at  t = 0 which  is  not  0,  then  the  derivative  of  g (t)  will  have  a 
discontinuity  at  t = 0 which  can  be  "removed"  by  the  techniques  of 
Section  7 of  Rosser  [6J.  This  "removal"  would  improve  the  numerical 

quadrature  of  the  right  side  of  (7.22).  Certainly,  the  smoother 

# 

g (t)  is,  the  better  will  be  the  numerical  quadrature  of  the  right 
side  of  (7.22). 
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A FOURIER- SOLUTION  OF  PARABOLIC  PDE  BY  TAYLOR  SERIES 


Y.  F.  Chang 

Computer  Science  Department 
University  of  Nebraska, 
Lincoln,  Nebraska  68588 


ABSTRACT . Two-dimensional  Taylor  series  are  used  to  solve  initial- 
boundary value  problems  in  parabolic  partial  differential  equations.  Earlier 
works  by  the  author  dealt  with  basic  concepts  in  the  solutions  of  parabolic 
PDE's  by  Taylor  series.  The  initial  and  boundary  conditions  were  then 
restricted  to  be  compatible  (i.e.  either  initial  or  boundary  conditions 
alone  are  sufficient  to  solve  the  problem).  This  paper  will  deal  with 
problems  where  the  initial  and  boundary  conditions  are  incompatible  and 
even  inconsistent.  The  solution  is  obtained  using  two-dimensional  Taylor 
series  in  a manner  analogous  to  Fourier-ser ies  solutions.  The  following 
example  is  solved  to  illustrate  this  method  of  Taylor  series. 

2 

u^^  - x . u ■ u , satisfying  u(0,t)  = u(2,t)  = 0,  and  u(x,0)  = 1 . 

The  solution  proceeds  as  follows.  Using  the  differential  equation  as  the 
recursive  relation,  a family  of  two-dimensional  Taylor  series  is  constructed 
defining  functions  which  satisfy  both  the  equation  and  the  boundary  condi- 
tions, and  which  resemble  the  functions  s in(n u x/2) . The  final  solution  is 
a linear  combination  of  the  family  of  two-dimensional  series  with  coeffi- 
cients determined  by  a Fourier-type  approximation  of  the  initial  condition. 

1.  INTRODUCTION.  The  application  of  Taylor  series  in  the  solution  of 
ordinary  differential  equations  has  been  treated  with  growing  interest  in  the 
published  literature.  For  example,  study  the  works  of  Cibbons(l), 

Richtmeyer (2) , Fehlberg(3),  Hartwell(A),  Leavitt(5),  Barton,  et  al(6),  and 
Chang(7).  At  present,  there  is  a need  to  explore  the  application  of  multi- 
dimensional Taylor  series  to  the  solutions  of  partial  differential  equations 
(PDE) . 


The  solution  of  parabolic  PDE's  with  compatible  initial  and  boundary 
conditions  has  been  discussed  by  Chang(8)  & (9).  A compatible  problem  is 
one  for  which  the  initial  condition  (or  the  boundary  conditions)  alone  is 
sufficient  to  uniquely  determine  the  solution  to  the  problem.  When  the 
boundary  conditions  are  used  to  determine  the  solution,  as  in  Chang(8),  it 
is  necessary  to  use  three-dimensional  Taylor  series,  whose  terms  are  con- 
structed recursively  from  the  PDE  and  the  boundary  conditions.  This  is 
accomplished  without  iteration.  When  the  initial  condition  is  used  to 
determine  the  solution,  as  in  Chang(9),  it  is  necessary  to  use  two-dimen- 
sional Taylor  series,  whose  terms  are  constructed  recursively  from  the  PDE 
and  the  initial  condition.  This  solution  also  is  accomplished  without 
iteration. 

In  this  paper,  we  will  discuss  the  solution  of  parabolic  PDE  with 
incompatible  and  even  inconsistent  initial  and  boundary  conditions.  The 
following  sample  problem  will  be  solved  to  illustrate  this  method  of  two- 
dimensional  Taylor  series  solution. 
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2 

X -u  * u 


XX 


, satisfying  u(0,t)  = u(2,t)  = 0,  and  u(x,0)  = 1 . (1) 


We  will  first  briefly  review  some  elementary  concepts  in  the  Fourier 
solution  of  the  heat  equation. 

Uxx  ” Ut  ’ satisfyin8  u(0,t)  = u( 2 , t ) = 0,  and  u(x,0)  = 1 . (2) 

Consider  the  family  of  functions  f (x)  = sin(ni  x/2),  with  n=l,2,3...,  which 
satisfy  the  boundary  conditions  f Fo)  = f (2)  = 0 for  all  n.  Next,  consider 
the  functions  n n 


y (x,t)  = exp(- a t)-f  (x)  . 


(3) 


The  y-functions  satisfy  the  heat  equation  if  a = ni  x/2.  The  solution  of 
Eq.(2)  is  found  as  the  linear  combination  of  the  y-functions, 


u ( x , t ) = l cn*yn(x,t) 
n=  1 


where  the  coefficients  c are  determined  from  the  initial  condition.  Before 
determining  c , it  is  best  to  or thonormal ize  y (x,t)  in  the  interval  [o , 2 j . 
Then,  the  coe?ficients  are  found  by  the  inner  product 


/ 


y (x,0)*u(x,0)  dx  = C f (x)-u(x,0)  dx  , 
0 n 0 n 


where  u(x,0)  can  be  any  given  initial  condition. 

2.  DISCUSSION  OF  TAYLOR  SERIES  METHOD.  The  development  of  the  Taylor 
series  method  for  the  solution  of  parabolic  PDE's  begins  with  a detailed 
examination  of  the  solution  functions  y (x,t),  see  Eq.(3).  We  expand  y (x,t) 
in  a two-dimensional  Taylor  series  abouF  the  point  x=0  and  t=0,  with  ann 
increment  h in  x and  an  increment  g in  t . The  result  is  shown  in  Table  I, 
with  orders  of  x placed  horizontally  and  orders  of  t placed  vertically. 

There  are  four  important  features  to  be  noted  concerning  Table  I. 
(a)Every  odd  column  in  Table  I contains  only  zero  elements,  so  y (x,t) 
satisfies  the  boundary  condition  u(0,t)  = 0.  (b)The  sum  of  all  Fbe  terms 
in  each  individual  row  is  zero  for  a * nrtx/2  and  h = 2,  so  y (x,t)  satisfies 
the  boundary  condition  u(2,t)  *=  0.  (c)There  are  pairs  of  terms  in  this  2-D 
array  that  satisfy  the  recursive  relation 


*<"•“>  ' g (n-mn-2) 


(4) 


where  n and  m are  the  orders  of  the  terms  in  the  x and  t directions.  This 
recursive  relation  is  derivable  from  the  heat  equation.  (d)The  adjacent  rows 
differ  by  a constant.  For  example,  rows  #1  and  differ  by  the  constant 
-a  g and  rows  #3  and  #4  differ  by  the  constant  -a  g/3.  This  is  an  important 
observation  that  will  aid  in  the  solution  of  the  more  complex  problem. 


no 


Table  I. 

The  Two- 

Dimensional  Arrav 

for  y (x.t)  . 

rr 

X 

t 

1 

2 

3 

4 

5 

6 

7 

8 

1 

0 

ah 

0 

3,  3 
-a  n 

0 

5k5 
a h 

0 

V 

-ah 

31 

51 

71 

2 

0 

3. 

-a  hg 

0 

5.  3 
a_h_g 

31 

0 - 

7,  5 

a.h  £ 

51 

0 

9.  7 
a h g 

71 

3 

0 

5.  2 
a hg 

0 

7U3  2 
-a  h g 

0 

9.  5 2 

a h g 

0 

11,7  2 
-a  h e 

21 

21  31 

21  51 

21  71 

4 

0 

7,  3 
-a  hg 

0 

9 3 3 
a h g 

. 

11,  5 3 
a h g 

0 

13,7  3 
a h g 

31 

31  31 

0 

31  51 

31  71 

5 

0 

9u  4 

a hg 

0 

11,3  4 
-a  h g 

0 

13,5  4 
a h g 

0 

15,7  4 
-a  h g 

41 

41  31 

41  51 

41  71 

We  will  derive  Eq.(4)  from  the  heat  equation  and  extend  that  to  the 
problem  in  Eq.(l).  Consider  a reduced  derivative  defined  as 

j>n+m  n m 

Y(n+l,m+l)  « — Z—  , n,m=  0,1,2...  (5) 

Jxn  5tm  n'.  ml 

where  y is  a function  of  x and  t , h is  the  increment  in  x,  and  g is  the 
increment  in  t.  (The  orders  n+1  and  nH-1  are  necessary  due  to  a limitation 
of  FORTRAN.)  In  this  paper,  upper-case  letters  are  reserved  for  the  reduced 
derivatives  of  the  underlying  lower-case  functions.  The  reduced  derivatives 
are  the  terms  of  the  two-dimensional  Taylor  series  calculated  at  a specific 
point  (0,0)  with  increments  h and  g.  Given  a function  f(x,t),  the  value  of 
f at  (0,0)  is  stored  as  F(l,l),  the  value  of  dnf/dxn  at  (0,0)  is  stored  in 
F(n+l,l),  and  the  value  of  dmf/3tm  at  (0,0)  is  stored  in  F(l,nri-1).  Graphi- 
cally, given  a 2-D  array  for  f(x,t),  differentiation  of  f with  respect  to  x 
n-times  yields  the  term  n-spaces  to  the  right,  and  differentiation  of  f with 
respect  to  t m-times  yields  the  term  m-spaces  down. 

Consider  the  function  w(x,t)  = J2u/dx2,  or  in  terms  of  reduced  deriv- 
atives W(l,l)  • 21 -U(3,l)/h2.  Differentiation  of  w with  respect  to  t m-times 
yields  the  term  m-spaces  down  from  U(3,l)  in  the  2-D  U-array.  The  result  is 
W(l,nH-l)  “ 21 •U(3,mfl)/h^.  However,  differentiation  of  w with  respect  to  x 
does  not  yield  the  term  one  space  to  the  right  of  U(3,l)  in  the  U-array. 

The  term  W(2,l)  is  not  equal  to  2 1 -U(4 , 1) /h2 . This  is  due  to  a mismatch  in 
the  factorial  functions  between  the  W-array  and  the  U-array.  Observe  that 
the  true  relations  are 

W(n , 1)  - -n<^7^  U(n+2 , 1)  , and  W(n,m)  = U(n+2,m)  . (6) 


111 


Therefore,  differentiation  of  w with  respect  to  x n- times  yields  the  term 
n-spaces  to  the  right  of  U(3,l)  multiplied  by  n(n+l)/h2.  The  same  analysis 

applies  for  the  function  v(x,t)  * du/dt,  with  the  result 

V(n,m)  - -21-  U(n,mfl)  . (7) 

g 

Consider  next  Leibnitz'  Rule  for  the  derivatives  of  a product 
y(x,t)  - p(x,t)  -qU.t) 

1 dRy  _ " _1 1 3n  iq 

"•  dxn  J.Q  i!  dx1  (n-i)!  ax""1  * 

In  terms  of  reduced  derivatives,  the  modified  Leibnitz'  Rule  is 


n 

Y(n,l)  = 2 P(i,l) -Q(n- i+1,1)  . (8) 

i*  1 


The  modified  Leibnitz'  Rule  will  be  very  useful  in  the  analysis  of  the  PDE 
in  Eq  . ( 1)  . 

We  will  now  derive  the  recursive  relation,  see  Eq.(4),  from  the  heat 
equation  using  two  different  approaches.  Differentiation  of  the  heat 
equation  with  respect  to  x n- times  and  with  respect  to  t m- times  yields  the 
general  form 

dn-Hrrf  2 »n+mfl 

u_  _ a u_ 

3xn+2  dtm  " 5xn  dt**1  ’ 

which  when  written  in  terms  of  the  reduced  derivatives  becomes 


Ln.+JJ^n+22  U(n+3  jirH-i)  = u(n+l,m+2)  . 

With  adjustment  for  the  orders,  this  is  the  same  recursive  relation  as  that 
given  in  Eq . (4)  . 

The  second  approach  begins  with  the  heat  equation  written  in  terms  of 
the  reduced  derivatives 

U(3,l)  = -j-  U(l,2)  . (9) 

Recognizing  the  relations  in  Eq.(6)  and  in  Eq.(7),  the  repeated  differentiation 
of  the  functions  represented  by  the  terms  in  Eq.(9)  with  respect  to  x and  t 
yields  the  recursive  relation 

U(n+2,m)  = — U(n  ,m+l)  . 

n g 

Except  for  the  orders,  this  recursive  relation  is  the  same  as  those  above. 
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Next,  we  will  derive  the  recursive  relation  for  the  parabolic  PDE  of 


u “ x *u  + u 
xx  t 

Repeated  differentiation  of  Eq.(10)  with  respect  to  x n-times  yields 

*n+2  „ >,n  -sn- 1 >n-  2 Nn+1 

“7  + n ( n-  1 ) 

*xr 


dx 


n+2 


o ^ xn-  2 vn+1 

a . X2  2-a  + 2nx  a + n(n-l)  ^ + * H 

>~n  '~n“1  ^x11’2  ^xn  ^ 


(10) 


(11) 


Just  as  in  the  case  of  the  heat  equation,  we  will  expand  the  solution  function 
of  Eq.(10)  about  x*0  and  t“0.  With  this  condition,  Eq.(ll)  becomes 


\n+2 
d u 

\ n+2 


>n-  2 Nrri-1 

n(n-l)  £ + * 


£x 


n-2 


axn  3t 


(12) 


Repeated  differentiation  of  Eq.(12)  with  respect  to  t m-times  yields  the 
general  form 


\rrfm+2 

0 u 

axn+2  at 


m 


\n+m-  2 

n(n-l)  * M 


ax11'2  atm 


^n+mfl 

axn  at 


u 

mfl 


(13) 


When  written  in  terms  of  the  reduced  derivatives,  Eq.(13)  becomes  the  desired 
recursive  relation 


Lnt.ll^n+2)  u(n+3,nH-1)  - h2II(n-l,m+l)  + ^ U(n+l,mf2)  . (14) 

We  will  derive  Eq.(14)  from  a second  approach.  Written  in  terms  of  the 
reduced  derivatives,  Eq.(10)  becomes 

U(3,l)  = x2U(l,l)  + -j-  U(l,2)  . (15) 

o 

Applying  the  modified  Leibnitz'  Rule  to  the  product  f = x -u,  we  find 
F(n,l)  - x2U(n,l)  + 2xh-U(n-l,l)  + h2-U(n-2,l)  . 

With  x"0,  this  becomes 

F(n, 1)  - h2U(n-2,l)  . (16) 

Recognizing  the  relations  in  Eq.(6)  and  in  Eq.(7),  and  applying  the  result 
in  Eq.(16),  the  repeated  differentiation  of  the  functions  represented  by  the 
terms  in  Eq.(15)  with  respect  to  x n-times  and  t m-times  yields  the  result 

U(n+2,m)  = h2U(n-  2 ,m)  +—  U(n,m+1)  . (17) 

8 

Except  for  adjustment  of  orders,  Eq.(14)  and  Eq.(17)  are  identical.  This  is 
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the  recursive  relation  that  we  will  use  to  derive  the  2- D array  solution 
function  for  Eq.(10). 


The  boundary  condition  at  x*0  is  satisfied  if  the  first  column  in  the 
2-D  array  has  only  zero  elements.  A simple  examination  of  Eq.(17)  shows  that 
the  recursive  relation  is  between  terms  in  alternate  columns.  Therefore, 
since  the  first  column  contains  only  zero  elements,  all  odd  columns  in  the 
array  will  contain  only  zero  elements.  This  result  is  the  same  as  that  for 
the  heat  equation. 

In  the  2-D  array  for  the  solution  func  ions  of  the  heat  equation,  the 
adjacent  rows  differ  by  a constant.  We  will  show  that  the  same  is  true  for 
the  solution  functions  of  Eq.(10).  Let  the  second-column  terms  be  given  by 

Y(2,l)  - a-h  , Y(2,2)  = k^-h-g  , 

and  Y(2,m)  “ k a -h  ■ g1""  V(m- 1)  ! , (18] 

m 

where  k g^Vcm-l)!  is  the  constant  multiplier  between  t e higher-order  rows 
and  the  first  row.  By  Eq . ( 1 7) , the  fourth-column  terms  are 

Y(4 , 1)  - k a-h3/6  , Y(4,2)  - k^a-hVe  . 


v<4-">  ’ Vi  tohiT  • 


If  the  adjacent  rows  are  to  be  different  by  a constant,  the  following  must 
be  true  ; 


Y(2 , 1) 


Y(4,m) 


1 11 

V 


(m-  1)  1 Y(4 , 1)  k2  (m-1)'.  ' 


This  implies  that  k * k , /k„  , or  k = k, 
m m+i  Z m Z 

This  relation  can  also  be  derived  from  the  comparison  of  the  terms  between 
any  pair  of  even- order  columns. 

The  series  in  t given  in  Eq.(18)  can  be  written  as 

Y(2,m)  - (k2g)m”  Ja -h/(m- 1) '. 

This  is  the  series  for  an  exponential  function  in  t,  specifically 

^ i = exp(k2t) -a  . ( 


The  statement  equivalent  to  Eq.(19)  for  the  heat  equation  is 


= exp(-a  t) -a  , 


so  we  identify,  k * -a  . This  is  not  a necessary  identification  since  "a" 
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is  a constant  whose  value  is  yet  to  be  determined.  It  is  merely  convenient 
to  merge  the  two  constants  into  one;  we  will  be  normalizing  the  solution 
functions  later. 

Now,  we  will  consider  the  boundary  condition  u(2,t)  = 0.  First,  we 
rewrite  Eq.(17)  for  just  the  top- row  terms; 

u(n,l)  ° h2U(n-4,l)  + -J-  U(n-  2,2)  . (20) 

The  last  term  on  the  right  belongs  in  the  second  row,  which  in  turns  depends 
on  a term  in  the  third  row,  etc.  Here  is  the  reason  why  it  is  so  important 
to  show  that  the  rows  differ  by  a constant.  The  multiplier  between  the  top 
row  and  the  second  row  is  k^g  = -a^g;  therefore,  Eq.(20)  becomes 

(.»-  V^n.-Z)  U(n , 1 ) = h2U(n-4,l)  - a2U(n-2,l)  . 


This  equation  has  only  terms  in  the  top  row;  therefore,  the  function  therein 
is  not  a function  of  t,  and  we  can  write 

^n"~  ^ ' h2  <1  (n-4)  - a 2 >1  (n-2)  . 

The  solution  function  of  the  problem  of  Eq.(l)  is  then  of  the  form 

y (x,t)  = exp(-a“t)  • (ax)  . (21) 

J n n 


Since  the  rows  in  the  2-D  array  of  the  solution  function  differ  only 

by  constants,  it  is  only  necessary  to  match  a single  row  to  the  boundary 

condition  u(2,0)  * 0.  Just  as  in  the  solution  of  the  heat  equation,  where 

sin(2a)  “ 0,  we  now  have  that  <P  (2a)  = 0.  The  function  (ax)  with  terms 

n + n 

up  to  the  7-th  order  is  as  follows 


$ (ax) 
n 


3 3 5 5 5 ,.,  3 7 

a x ax  a x 13a  x 

6 20  20  2520 


7 7 
a x 

2520 


+ . . . . 


For  n ■ 21,  it  is  necessary  to  calculate  up  to  terms  of  the  209-th  order  for 
convergence  of  the  Taylor  series.  The  boundary  condition  u(2,0)  = 0 means 
that  <p  (2a)  ■ 0.  Therefore,  the  task  is  to  solve  for  all  the  values  of  "a" 
which  satisfy  the  above  condition.  The  values  of  "a"  found  by  Newton 
iteration  are  listed  in  Table  II. 

These  values  of  "a"  are  accurate,  as  given  in  Table  II,  to  10  They 

are  calculated  using  extended- prec is  ion  arithmetic  with  accuracy  limited  only 
by  the  storage  capacity  of  the  computer.  The  necessity  of  using  extended- 
precision  arithmetic  comes  about  from  the  need  for  more  accuracy  than  what  is 
available  from  double- prec is  ion  on  the  CDC-6400  computer.  The  extended- 
precision  arithmetic  program  is  written  in  FORTRAN  using  some  of  the  ideas 
given  by  Knuth(lO).  We  have  found  that  it  is  not  advisable  to  perform  binary 
operations  when  the  language  is  FORTRAN.  Therefore,  we  used  integer  numbers 
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with  a base  as  large  as  possible.  For  the  CDC-6400,  this  is  10  . The 

extended- prec is  ion  arithmetic  operations  are  performed  on  numbers  stored  in 
arrays  with  each  location  interpreted  as  one  base  10^  digit.  All  operations 
are  integer  in  nature  and  similar  to  primary-school  arithmetic.  For  the  sake 
of  speed,  we  use  the  first  location  in  each  array  as  the  sign  of  the  number, 
and  we  search  each  array  for  the  significant  digits  and  only  perform  opera- 
tions on  non-zero  digits.  Except  for  the  sign  in  the  first  location  of  the 
arrays,  positive  and  negative  numbers  look  exactly  alike.  Extended-precision 
arithmetic  execution  time  on  the  CDC-6400  is  about  12- times  longer  than  the 
execution  time  for  double-precision.  This  comparison  is  made  using  numbers 
that  can  be  handled  by  double- prec is  ion.  Once  the  numbers  exceed  the  limit 
of  double- prec is  ion,  there  can  be  no  meaningful  comparison.  A listing  of  the 
FORTRAN  subroutine  for  extended- prec is  ion  arithmetic  is  available  from  the 
author . 


The  next  step  in  the  solution  of  parabolic  PDE's  is  the  orthonormal- 
ization of  the  (ax)  functions.  The  orthonormalization  of  the  Taylor  series 
is  performed  using  the  Gram-Schmidt  method,  which  has  been  studied  in  Chang 
and  Colton(ll).  Using  extended- precision  arithmetic  for  ultimate  accuracy, 
we  have  found  that  the  4>  -functions  are  orthogonal  in  the  interval  [0,2]. 

So,  only  normalization  is  required.  The  normalization  process  adjusts  for 
the  fact  that  k?  is  not  exactly  equal  to  (-a^). 


Let  us  summarize 
see  Eq . (21)  . 


the  steps  in  the  derivation  for  the  solution  function, 
2 

y (x,t)  = exp(-a  t)-<j>  (ax) 
n n 


2 

The  time- dependent  portion  of  the  solution  function  is  exp(-a  t)  since  the 
rows  of  the  2-D  array  differ  by  constants,  see  Eq.(19).  We  have  also  shown 
that  <p  (ax)  is  a function  of  x only,  see  the  discussion  following  Eq.(20). 
Then,  we  have  constructed  the  Taylor  series  terms  for  <J>  (ax)  and  solved  for 
the  values  of  "a"  which  satisfy  u(0,t)  = u(2,t)  = 0,  see  Table  II.  The 
-functions  form  an  orthogonal  set  on  the  interval  [0,2].  And  we  have 
normalized  them  to  form  an  orthonormal  set. 


The  remaining  task  in  the  solution  of  Eq.(l)  is  to  find  the  Fourier- 
type  approximation  for  the  initial  condition  using  the  $ -functions.  Just 
as  in  the  heat  equation  solution,  we  will  find  the  final  solution  as  the 
linear  combination  of  y^(x,t). 


u(x,t) 


CD 

Z c -y  (x , t) 
, n n 


(22) 


where  the  coefficients  c^  are  found  by  the  inner  product 


■ J <t>  (x)-u(x,0)  dx 
n q n 


In  the  next  section,  we  will  apply  this  Taylor  series  method  to  three  sample 
problems . 
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Table  II.  The  Functional  Constants  for  4>  (2a)  = 0. 


n Value  of  "a” 


1 

1.87873 

17204 

86263 

2 

3.34204 

70011 

74951 

3 

4.85076 

94487 

09503 

4 

6.38803 

89296 

70844 

5 

7.93823 

64977 

79745 

6 

9.49515 

38510 

55045 

7 

11.05597 

96202 

17529 

8 

12.61927 

20295 

99243 

9 

14.18421 

87256 

09143 

10 

15.75032 

80128 

90541 

11 

17.31728 

51600 

97127 

12 

18.88487 

94810 

13789 

13 

20.45296 

46754 

95658 

14 

22.02143 

60050 

07426 

15 

23.59021 

65254 

73077 

16 

25.15924 

84480 

76606 

17 

26.72848 

75315 

53938 

18 

28.29789 

93343 

11262 

19 

29.86745 

66448 

47917 

20 

31.43713 

76800 

28297 

21 

33.00692 

47963 

29488 
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3 . NUMERICAL  EXAMPLES . The  three  sample  problems  to  be  solved  differ 
only  in  their  initial  conditions.  This  permits  us  to  use  the  same  $n-functions 
(which  are  dependent  only  on  the  PDE  and  the  boundary  conditions)  for  the 
solution  of  increasingly  more  difficult  problems  leading  up  to  the  problem 
of  Eq.(l). 

EXAMPLE  1:  u^  - x^*u  ■ ut  , satisfying  u(0,t)  ” u(2,t)  = 0,  and  u(x,0)=2x-x^, 

in  the  interval  [0,2].  The  coefficients  c calculated  for  this  problem  are 
listed  in  Table  111.  Only  13  coefficients  are  listed  because  sufficient 
accuracy  is  obtained  at  that  point.  Inclusion  of  higher-order  solution 
functions  would  tend  to  degrade  the  acuracy,  because  we  did  not  use  extended- 
precision  in  this  part  of  the  solution.  The  error  given  in  Table  111  is  the 
difference  between  the  given  initial  condition,  u(x,0)  * 2x-x2,  and  the 
solution  at  the  point  specified.  This  is  the  largest  error  found  for  each 
of  the  approximate  solutions. 


Table  III,  Coefficients  c for  Example  1. 

n 


n 

c 

n 

error 

at  x 

i 

1.02735 

0.33 

1.9 

2 

- .09380 

0.19 

1.9 

3 

.04810 

0.075 

1.9 

4 

- .00335 

0.065 

1.9 

5 

.00900 

0.032 

1.9 

6 

- .00042 

0.030 

1.9 

7 

.00314 

0.015 

1.9 

8 

- .00010 

0.015 

1.9 

9 

.00145 

0.0072 

1.9 

10 

- .00003 

0.0070 

1.9 

11 

.00079 

0.0029 

1.9 

12 

-.00001 

0.0028 

1.9 

13 

.00048 

0.0010 

1.8 

EXAMPLE  2:  u^  - x^'u  “ ut  , satisfying  u(0,t)  = u(2,t)  = 0, 

and  u(x,0)  * x on  [0,l],  and  u(x,0)  = 1 - x on  [l,2l. 

This  is  a triangular  function  whose  derivatives  at  x=l  do  not  exist.  The 
coefficients  c for  this  problem  are  listed  in  Table  IV.  All  twenty-one 
solution  functions  are  needed  in  this  solution  because  of  the  discontinuity 
at  the  peak  of  the  triangle.  The  error  given  in  Table  IV  is  the  difference 
between  the  initial  condition  u(l,0)  = 1 and  the  approximate  solution  at 
x"  1 . 
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Table 

IV.  Coefficients 

c for  Example 

n 

c 

n 

II  1 

error  at  x=l 

i 

.80734 

0.19 

2 

- .08126 

0.18 

3 

- .08203 

0.099 

4 

.00379 

0.099 

5 

.03237 

0.067 

6 

- .00141 

0.067 

7 

- .01641 

0.050 

8 

.00050 

0.050 

9 

.00999 

0.040 

10 

- .00027 

0.040 

11 

- .00668 

0.034 

12 

.00015 

0.034 

13 

.00479 

0.029 

14 

- .00010 

0.029 

15 

- .00360 

0.025 

16 

.00006 

0.025 

17 

.00280 

0.022 

18 

- .00005 

0.022 

19 

- .00224 

0.020 

20 

.00003 

0.020 

21 

.00209 

0.018 

2 

EXAMPLE  3:  - x *u  ■ u , satisfying  u(0,t)  = u(2,t)  = 0,  and  u(x,0)  = 1 . 

This  is  the  problem  stated  at  the  beginning  of  this  discussion;  the  initial 
and  boundary  conditions  are  inconsistent  at  x=0  and  at  x=2.  The  coefficients 
c calculated  for  this  problem  are  listed  in  Table  V.  All  twenty-one  of  the 
solution  functions  are  used  in  this  solution;  higher-order  functions  would 
have  improved  the  accuracy.  However,  it  should  be  pointed  out  that  since  the 
<j>  - functions  are  orthogonal  and  normalized,  the  solutions  as  listed  in 
Table  V are  complete  up  to  the  order  given.  Furthermore,  since  the  higher- 
order  functions  have  faster  exponential  decay  constants  than  the  low-order 
functions,  the  errors  given  in  Table  V quickly  disappear  for  small  values 
of  time.  The  error  given  in  Table  V is  the  difference  between  the  initial 
condition  u(x,0)  “ 1 and  the  approximate  solution  at  x = 0.05. 

We  have  chosen  to  tabulate  the  error  at  x = 0.05,  because  this  point 
is  only  2.5%  of  the  entire  range  of  the  problem.  It  is  true  that  there  is 
much  error  at  this  point  with  even  21  <t>  -functions.  However,  this  error 
exists  only  at  t“0. 
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Table  V.  Coefficients  c for  the  Problem  of  Eq.(l). 

n — ‘ — 


n 

c 

n 

error  at  x=0.05 

i 

1.26588 

0.88 

2 

- .09146 

0.90 

3 

.43438 

0.80 

4 

- .01233 

0.80 

5 

.25678 

0.70 

6 

- .00362 

0.70 

7 

.18266 

0.61 

8 

- .00152 

0.61 

9 

.14183 

0.52 

10 

- .00078 

0.52 

11 

.11594 

0.43 

12 

- .00045 

0.43 

13 

.09806 

0.34 

14 

- .00028 

0.34 

15 

.08496 

0.27 

16 

- .00019 

0.27 

17 

.07495 

0.19 

18 

- .00013 

0.19 

19 

.06706 

0.13 

20 

- .00042 

0.13 

21 

.06151 

0.066 

4 . CONCLUS IONS . We  have  developed  a Fourier-type  solution  of  parabolic 
partial  differential  equations  using  two-dimensional  Taylor  series.  The 
attraction  of  this  method  is  in  its  simplicity  and  the  fact  that  the  solution 
functions  can  be  orthonormalized . Thus,  it  is  possible  to  ignore  the  high- 
order  functions  if  one  is  not  overly  concerned  with  the  accuracy  at  t=0. 

The  solution  time  on  a CDC-6400  computer  is  divided  approximately  as 

follows;  45  seconds  for  the  calculation  of  the  constants  "a"  given  in 

Table  II  (up  to  15  Newton  iterations  were  required  for  convergence),  4 seconds 

to  normalize  the  <f>  -functions,  and  3 seconds  to  calculate  the  c coefficients, 
n n 
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ABSTRACT 

This  paper  illustrates  the  application  of  a "Sinc-Galerkin"  method  to  the 
approximate  solution  of  linear  and  nonlinear  second  order  ordinary  differ- 
ential equations,  and  to  the  approximate  solution  of  some  linear  elliptic 
and  parabolic  partial  differential  equations  in  the  plane.  The  method 
is  based  on  approximating  functions  and  their  derivatives  by  use  of  the 
Whittaker  cardinal  function.  The  DE  is  reduced  to  a system  of  algebraic 
equations  via  new  accurate  explicit  approximations  of  the  inner  products, 
the  evaluation  of  which  does  not  require  any  numerical  integration. 

Using  n function  evaluations  the  error  in  the  final  approximation  to  the 

l/2d 

solution  of  the  DE  is  0(e  ) where  c is  independent  of  n,  and  d 

denotes  the  dimension  of  the  region  on  which  the  DE  is  defined. 
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1 . Int  roduc  t ion  and  Summary 


The  function  sinc(x)  is  defined  on  the  real  line  by 


(i.i) 


sine (x) 


inj  ) 

TIX 


The  whittaker  cardinal  function  of  an  arbitrary  function  f is  defined 
for  anv  h > 0 by 

CO 

(1.2)  C ( f , h , x ) = l f(kh)sinc[— ^ -] , h > 0, 

k=-“ 

whenever  this  series  converges. 

The  approximation  of  f using  a finite  number  of  terms  of  (1.2) 
has  been  extensively  studied.  The  paper  [7]  contains  a review  of  the 
properties  of  C(f,h,x)  which  were  discovered  by  E.T.  Whittaker  [14], 
J.M.  Whittaker  [15],  Hartly  [4],  liyquist  [8]  and  Shannon  [11].  in  [12] 
new  approximations  are  derived  by  means  of  C(f,h,x),  for  interpolating, 
integrating  and  approximating  the  Fourier  (over  (-=“,“>)  only)  and  Hilbert 
transforms  over  (-“>,“),  (0,»)  and  (-1,1)  . In  [6]  the  function 
C(f,h,x)  is  used  to  obtain  formulas  for  approximating  the  derivatives 
of  functions  over  (-“-,“),  (0,“)  and  (-1,1)  . 

In  the  present  paper  we  use  results  of  [6,12]  to  derive  basis 
functions  for  Galerkitr  schemes  of  solving  second  order  problems, 

and  we  derive  explicit  and  highly  accurate  expressions  for  inner 

7 

products  such  as  (f  vy  , • (f  , b^)  > (fu>  . All  of 

dxi’ 

these  are  expressed  in  terms  of  the  function  values  of  u,  and  not 
the  derivatives  of  u . We  then  study  the  application  of  the  derived 
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approximations  on  the  approximate  solution  of  some  ordinary  and  partial 
differential  equations,  via  the  Galerkin  method.  The  combined  method 
thus  yields  a system  of  algebraic  equations,  without  the  use  of  any 
quadrature . 

Let  us  briefly  compare  the  present  method  of  approximate  solution 
of  linear  differential  equations  with  currently  popular  finite  difference 
methods,  or  with  finite  element  methods  that  use  piecewise  linear 
elements.  The  finite  difference  or  finite  element  methods  lead  to  a 
sparse  system  of  equations.  The  use  of  n solution  evaluations 
usually  leads  to  a linear  system  of  n algebraic  equations  having 
non-singular  coefficient  matrix.  The  error  in  the  resultion  approxi- 
mate solution  is  0(n  ^) , where  p is  usually  1 or  2 . The  "Sinc- 
Galerkin"  method  of  the  present  paper  also  leads  to  a system  of  order 
n on  the  basis  of  n solution  evaluations.  This  system  has  a 
nonsingular  full  matrix.  The  error  in  the 

-cn1/(2d) 

resulting  approximate  solution  is  0(e  ) , where  d denotes 

the  dimension  of  the  problem.  The  advantage  of  the  present  method  is 
that  due  to  its  rapid  convergence  it  does  not  require  the  solution  of 
a very  large  system  of  equations  in  order  to  achieve  a derived  accuracy, 
if  more  than  two  significant  figures  of  accuracy  are  required  in  the 
approximate  solution.  In  addition,  the  rate  of  convergence  of  the 
present  method  is  the  same,  regardless  of  possible  singularities  of 
the  solution  of  an  equation  on  the  boundary  of  the  region. 

The  approximate  methods  of  [6,12]  have  previously  been  effectively 
applied  to  the  approximate  solution  of  integral  equations  via  Galerkin- 
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type  methods  in  [2,9,10].  In  [5]  an  effective  Galerkin-type  method 
is  derived  which  uses  approximations  derived  in  [12]  to  obtain  an 
approximate  solution  to  the  problem 


(1.3) 


y"  = y - y3/x2  , y(0)  = yC*)  = o 


via  the  minimization  of  a certain  nonlinear  functional.  In  all  of  the 

v 

-cn* 

above  cases  the  error  of  an  n-point  approximate  solution  is  0(e  ). 

In  Sec. 2 of  the  present  paper  we  review  the  relevant  known 
approximation  properties  of  Whittaker ' s cardinal  function,  and  we  then 
use  these  to  derive  explicit  approximate  inner  products,  in  general  as 
well  as  for  the  important  special  cases  of  the  intervals  [0,1],  [-1,1], 
[0,“]  and  [-“>,“]  . In  Sec. 3 we  illustrate  the  application  of  the 
previously  derived  formulas  to  the  approximate  solution  of  some  simple 
"model"  problems,  such  as  u"  = -2,  u"  = u - u /x  , uc  = uxx  and 

u + u = f,  with  appropriate  boundary  conditions.  In  Sec. 4 we 

XX  yV  1/ (2d) 

— cn 

carry  out  an  error  analysis,  proving  the  0(e  ) rate  of 

convergence  referred  to  above. 
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2.  Preliminaries  and  Fundamentals 

In  this  section  we  shall  recall  some  known  properties  [18]  and 
derive  some  new  properties  of  Whittaker's  cardinal  function,  which  we 
shall  require  in  this  paper. 

Definition  2.1.  Let  R denote  the  real  line,  C the  complex  plane,  and 

let  B(h)  denote  the  family  of  all  functions  defined  on  C that  are 

2 

entire,  such  that  f frL  (R)  and  such  that 

(2.1)  |f(z)1  < Ce1Tlyl/h,  z = x + iy  C, 

for  some  constant  C . Set 


(2.2) 

and 

(2.3) 


S(j,h)(x)  = sinc["h^h] 


- s(n)(j,i)(k)  = (^)ns(j,i)(x)| 


In  particular,  we  have 


(2.4) 


(0) 

f1 

if 

jk  - 1 

to 

if 

(1)  _ j 

r° 

if 

jk  1 

l(-Dk' 

k-j 

,(2) 

5jk 


if  j ^ k 


" • C 

- j if  j 


-2 (-1) 


k-j 


(k-j)‘ 


if  j * k 
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Theorem  2.2  [7]:  Let  ' f £ B(h)  . Then  f(z)  = C(f,h,z)  . Moreover 


(2.5) 


f (z)  = 


fh 


g(t)eiztdt  for  some  g £ L2(-  ^ ; 


(2.6) 


fOO 


sinc[^]f(t)dt  ; 


(2.7) 


| f (x) | dx  = h l | f (kh) | ' 
R k=-<=° 


and  the  sequence  {h  ‘S(k,h)}^_  ^ is  therefore  a complete  orthonormal 
sequence  in  B(h); 


(2.8) 


f 6 B(h)=>>  f’  e B (h)  . 


Theorem  2.3:  Let  be  defined  as  in  (2.3).  Then 

J K 


(2.9) 


r_..__rX-jh,  i (n)  . rx-kh,  , , 1-n  „(n) 

1 sine L",  w -J  ) sine [-r — ]dx  = h 6;.  , 


jk 


n = 0,1,2, .. . 


Proof:  Let  us  set 


(2.10) 


f(t)  = S(j,h)(t) 


and  let  us  note  that  f £ B(h)  . By  Eq.  (2.8)  it  thus  follows  that 

f ^ ^ £ B(h)  , n = 0,1,2,...  . Eq . (2.9)  thus  follows  by  taking 
(n) 


(n) 
f = S(j,h) 

(2.11) 


in  (2.6),  and  noting  by  (2.3)  that 


S(n)(j,h)(kh)  = h'n  6^ 

jk 


128 


Definition  2.4:  Let  d > 0,  and  let  B(P^)  denote  the  family  of  all 
functions  f that  are  analytic  in 


(2.12) 


P'  = [z  = x + iy  : |y|  < d) , 


such  that 
(2.13) 


rd 


-d 


[f(x+iy)|dy  -*•  0 as  x ->■  ±« 


and  such  that  N(f,P^)  < <»,  where 


(2.14) 

N(f,P^)  = 

Theorem  2 

j_5  [12]  : Le 

let  c(f) 

be  defined 

(2.15) 

Then 

(2.16)  e 

(f)(x)  = 

Moreover 

(2.17) 

!U(f) 

Definition  2.6:  Let 

plane  C, 

and  let  P 

< 

map  of  P 

onto  P', 
a 

a = i|i(-“) 

and  b = i(/ 

y-<-d~  J R 


| f (x-iy) | dx) 


c(f)(x)  = f(x)  - C(f,h,.\),  x 6 R 


r [ f(t-id~) 


2 tt  i 


f (t+id  ) 


(t-x-id) sin[ t-id) n/h)  (t-x+id) sin [ t+id)  n/h] 


N(f  »Pj) 


x£r 


2nd  sinh(ird/h) 


-1 


]dt 
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(2.18)  T = {w  £ V : w = iHx)  , _<  x £ °»}  . 

Let  B(P)  denote  the  family  of  all  functions  that  are  analytic  in  P, 
such  that  for  u real 

(2.19)  |f(z)dz|  -i-O  as  u » ±<® 

■ i|)(L+u) 

where 

(2.20)  L = {iy  : ~d  £ y £ d}, 
and  such  that 

(2.21)  N(f,P)  = lim  inf  ! |f(z)dz|  < 00 

C*-3 P.CCp  -'c 

(Note  that  if  f 6 B (P)  , then  f ° ik  f B(P').)  Set 

d 

(2.22)  xk  = iji(kh)  , k = 0,±1,±2, . . . , 

and  let  g be  a function  which  is  analytic  in  P,  whose  properties  we 
shall  determine  in  the  sequel.  Finally,  we  set 

(2.23)  S (z)  = g i. z ) s inc [-— 3 = g(z)S(j ,h).$(z)  . 

The  following  result  was  established  in  [6]. 


Theorem  2 

.7:  Let  m be  a nonnegative 

integer, 

and  let  f^'/g 

6 B(P)  . 

Let  there 

exist  positive  constants  a > 

0, 

co 

depending  only 

on  m , d 

and  g , 

depending  only  on  m and 

g. 

and 

C„  depending 

only  on 

m,  g and 

f , such  that 

(2.24) 

f (x)  _ -a|$(x) | 

g(x)  - C2  c 

for 

all  x £ T 
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(2.25) 


(2.26) 


,d  ,n,  . . 

bk(x) 


< C1  h for  all  x f f 


i c0  h-»  f„  .11  . « r.  . e » 


n=0,l , . . . ,ra. 


Then  there  exists  a constant  K depending  only  on  m,d,a,g  and  f such 


that  if  h = [td/(aN)]‘  then 

N f(x.) 


n+1 


(2.27) 


: (n)  , 


f'“'(x)  - \ “p4  sfn\x)|  <_  K !l  ‘ exp[-(nduN)  ‘ ] 

J— N 8 j J 


for  all  x f,  and  for  n = 0,1,. . . ,m  . 


Theorem  2.8  [12]:  If  f <rb(P),  then  the  identity 

£00 


(2.28) 


00  f (x  , ) 

yli)  ’ J VoD  s(j*h)^u) 

i=-"  j 


sin[  Titf  (x)  /h]  f f(z)  dz  

2 tt  i j [iji(z)-ij)(x)]sin[ii<i>(z)/h] 

3P 


is  valid  for  all  x £ r . Moreover 

(2.29) 


“ f(x.) 

f(x)dx  - h l 

J=-°“  j 


exp[-~ p— - sgn  Im  f>(z)] 
3p  sin[^(z)] 


f (z)dz 


The  results  of  this  theorem  may  be  conveniently  combined  with  those  of 
the  formulas  obtained  above,  to  yield  explicit  approximate  expressions 
for  inner  products.  The  results  of  the  following  lemma  are  useful  for 
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bounding  the  error  of  these  approximate  expressions. 


Lemma  2.9:  If  I Im  z | = d > 0 and  if  k is  an  integer,  then 


(2.30) 


(2.31) 


(2.32) 


(2.33) 


sine [ (z-kh) / h ] 


sin(ttz/h) 


^Cl(h’d)  E 2^ 


(d/dz) { sine [ (z-kh) /h] } 
sin(irz/h) 


. ^ /,  _ d+(h/iT)tanh(nd/h) 

<,  C.(h,d)  = = 

2d  tanh(ird/h) 


r{sinc[ (z-kh)/h] } 


dz 


sin(n/h) 


„ _ [(2h/ir)+n  d/h]d  tanh(tid/h)+2d 

4 C (h,d)  3 

2 d tanh(tid/h) 


|sin(irz/h)|  j>  sinh(nd/h)  , |cos(ttz/h)|  4 cosh(itd/h) 


Proof : We  shall  only  prove  (2.31),  since  the  proofs  of  the  remaining 
cases  are  similar,  and  we  omit  them.  We  have 


d re-/ , ,w  \ 1 cos( tt  (z-kh) /h]  h sin[  ti  (z-kh) /h] 

w = — tS(k,h)  (z)  ) = rr ~ 

dz  z-kh  it  , , , ,2 

(z-kh) 


Now  if  | Imz  | = d,  then  |z-kh|  > d,  | cos[tt  (z-kh)  ] | 4 cosh(trd/h)  and 
| sin(nz/h)  | >_  sinh(ttd/h)  ; hence 


sin(ttz/h) 


• (n) 


1 


+ « C,(h,d) 


— d tanh(nd/h)  2 2 

tid 


Theorem  2.10:  Let  6;,  be  defined  as  in  (2.3)  and  (2.4),  let  C.(h,d) 

J ^ J 

be  defined  as  in  Lemma  2.9,  let  x^  be  defined  as  in  (2.22),  as  in 

(2.23),  set  F^  = F(x^)  for  an  arbitrary  function  F,  and  let  r and 
f be  functions  which  are  analytic  in  V . 

(a)  Let  rfg  6 B(0)  • Then 
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(2.34) 


f f g 

r(x)f(x)S  (x)dx  - h — , v - 4 C.  (h,d)N(rfg,P)e-1Td/h 

r k ♦k  1 


(b)  Let  [rfg/<f](x)  ->-0  as  x ->  a and  as  x -*•  b along  I',  and  let 
(rg)'f  and  rgfi'f  <£  B(P)  . Then 


(2.35) 


oo  f(rg)'  6^^ 

r(x)f'(x)Sk(x)dx  + h l f + (rg)  Jil- 

j=_oo  J >■  vj  J J 


< [C1(h,d)N(f(rg)',0)  + C2(h,d)N(frg^' ,P)]e 


-ird/h 


(c)  Let  [ f rg' /$]  (x)  , [ f r g$ 1 / $ ] (x)  and  [f'rg/4>](x)  -+  0 as  x -►  a 
and  as  x -*•  b along  r,  and  let  f(rg)",  £[2( rg)'d'  + rg<t>")  and 
frg($')~  (r  B(P)  . Then 


(2.36)  |j  r(x)f "(x)Sk(x)dx 

M)  r (2 

“ f(r8)-  (0)  [2(rg)^:  + (rg)  *"]  5kj  6kj  ) 

- h l f -r-*1-  ^ -r — + (rg)  <?!  — y~  ' 

j—  3l4j  h 3 J h2 


< (C1(h,d)N(£(rg)",P)  + C2(h,d)N(f{2(rg)'«’+rg$"},P)  + C3 (h ,d)N(f rg(<f>  ' ) 2 ,V) ] 


Proof : We  shall  only  prove  the  (b)-part  of  Theorem  2.9,  since  the  proofs 
of  the  (a)  and  (c)-parts  are  similar. 


We  find,  upon  integration  by  parts,  that 


(2.37) 


r(x)f ' (x)Sk(x)dx  = r(x)f (x)S^(x) 


f (x) [r (x)S^(x)  + r ' (x)Sk(x) ]dx 
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The  first  term  on  the  right-hand  side  vanishes,  by  assumption  of  the 
(b)-part  of  the  theorem,  while  by  expansion  of  the  second  part  of  (2.37), 
we  have 


(2.38) 


j r (x)f ' (x) S^(x)dx 

f (x)  [ (rg)  ' (x)S(k,h).- 4>(x)  + (rg(f> ' ) (x)S ' (k,h) <■  ()>(x)  ]dx 

r 


Hence  by  replacing  f in  (2.29)  by  the  integrand  on  the  right-hand  side 
of  (2.38),  and  noting  that  if  z £ 327,  then  | Im<f> (z ) | = d and 
|exp[—  $(z)sgn  1m  $(z)]]  = e we  find  by  (2.29),  Lemma  2.9  and 

Theorem  2.3,  that 


r(x)f ' (x)S^(x)dx 


‘ L b C * t 


< e 


J= 
-7id/h 


[C1(h,d) | [f (rg) 'J(z) | + C2(h,d)|(frg*’)(z)|)|dz|, 


which  is  just  (2.35). 


Theorem  2.11:  Let  N be  a positive  integer,  a a positive  constant, 
and  take  h = [nd/(aN)]?  . 


(a)  Under 

(2.39) 

where 


the  assumptions  of  Theorem  2.9 


^kCk®k 

r(x)f (x)S,  (x)dx  - h — —7 — 

r k *k 


depends  only  on  f,r,g,d  and 


(a)  , 


a; 


-(itdaN)  ^ 
e 
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(b)  If  | [ rgf  ] (x)  J £ exp  [-a  1 1()  (x)  | ] on  r,  then  under  the  assumptions 
of  Theorem  2.9  (b) , 


(2. AO) 


N f(rg)!  /nx  i'T 

r(x)f’(x)S,  (x)dx  + h l f.  + (rg) 


(1) 


J-N  j 1 *i 


kj 


j h 


= ^2  e 


•(TidaN) 


k = -N.-N+l N , 

inhere  K0  depends  only  on  f,r,g,d  and  a; 

(3)  If  [f{2(rg)’  + rgd"/4>  ’ } ] (x)  and  [rgf<f'](x)  are  bounded  by 

exp  [-a  | <j>  (x)  | ] on  T,  then  under  the  assumptions  of  Theorem  2.9  (c)  , 


(2.41) 


r(x)f"(x)S^(x)dx 


(2) 


fi(1) 

N ((rg)'-'  (Q)  [2<rg)!*+(rg)  <0")  *kj 

-h  [ f , i-rr^  — + (rg)  — 


L>  i 

j=-N  J 


kj 


♦j 


< k3n4  e"(7,daN) 


k = -N , — N+l , . . . , N , 


where  depends  only  on  f,r,g,d  and  a . 

Proof : The  proof  is  similar  to  that  of  Theorem  8.1  of  [12],  and  we  omit 
it . 

The  results  of  Theorem  2.10  are  especially  suited  to  the  solution  of 
linear  differential  equations  via  a Galerkin  method,  for  which  the 
functions  {S^}  are  the  approximating  basis  functions.  We  remark  that 


r (x) f (x)Sk(x)dx, 


we  could  have  obtained  alternate  expressions  of 

by  combining  equations  (2.29)  and  (2.27),  i.e.,  if  rgf^  £ B(£>),  then 
by  Eq.  (2.29) 

(2.42) 


(n)  h rk8k  f (V 

r(x)f^n'(x)S,(x)dx — — rq — 

r k *k 


< C1(h,d)N(rgf(n),0)e",’d/h 
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However , 


and  we  could  now  use  (2.29)  to  approximate  f^n\x^)  on  T . 
the  resulting  expressions  are  not  as  accurate  as  those  of  Theorem  2.10. 
Nevertheless  the  pair  of  equations  (2.27)  and  (2.29)  do  form  a powerful 
combination  for  purposes  of  solving  nonlinear  equations.  For  example, 
if  G ^rB(P),  where  G = G(x , f (x) , f ' (x) ) , then 


(2.43) 


! f 


G(x,  f (x)  , f ' (x))S^(x)dx  - h 


G(xk,f(xk),f'(xk)) 

*,(xk> 


gU,.) 


< C1(h,d)N(G,0)  e 


-rd/h 


if  the  conditions  of  Theorem  2.7  are  satisfied  for  m = 1,  we  may  now 
replace  f*(xk)  in  (2.43)  by  the  approximation 


(1) 


(2.44) 


'•‘V*  **  * Vi  ^ 


given  by  Eq.  (2.27). 

The  approximating  expressions  of  Theorem  2.10  may  be  more  compactly 

expressed  by  means  of  matrices.  To  this  end,  let  m = 2N  + 1 , and  let 

S and  f be  column  vectors  defined  bv 
«-m  ^m  J 


(2.45)  S (x)  = 


s_N(x) 

f-N  ' 

S ,,  , (x) 

f 

-N+l 

-N+l 

, f 

• 

y*> 

fN 

Corresponding  to  a function  u = u(x),  let  A (u)  denote  a diagonal 
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matrix,  whose  diagonal  elements  are  u(x  , u(x_N+^) u(x^)  and 

whose  off-diagonal  elements  are  zero.  Let  1^  and  l^2^  denote  the 

— m <—  m 

matrices 


(2.46) 


0 

-1 


1 

0 


1 

2 

1 


1_  _ _1 _JL _1 

2N  2N-1  2N-2  “ 2N-3 


0 


(2.47) 


(2»)2 


2 

(2N-1) 2 


_ _2 _2 2 _ £ 

(2N) 2 (2N-1) 2 (2N-2)2  3 

With  this  notation,  Eqns.  (2.39),  (2.40)  and  (2.41)  take  the 
approximating  form 

r(x)f(x)S  (x)dx  = h A (yf-)  f 

p /*m  $ 

r(x)f’(x)S  (x)dx  = -h[A  (-r^-  ) + ^ A ( rg)  ] f 

J «-~m  ^ in  9 n m m in 

(2.48) 

r(x)f"(x)S  (x)dx  = h[Am(-^f^)  + £ lf1}  Am(2(rg)  + rg*"/*’) 
J P r- m <$>  n ~m 

+ 1(2)  A (rg<t>* ) ] f 

. 2 b m 

h 
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Contrary  to  the  case  of  matrices  that  arise  in  the  solution  of 


differential  equations  by  finite  difference  or  finite  element  methods, 

relatively  little  is  known  about  the  matrices  1^^  and  1^^  in 

«-m  m 

(2.46)  and  (2.47)  . To  this  end,  we  record  two  simple  observations. 

Theorem  2.12.  The  matrix  I^'>  in  (2.46)  is  a skew  symmetric  matrix 

■ m 

of  odd  order,  m = 2N  + 1 and  it  is  therefore  singular.  The  matrix 
(2) 

I is  a non-singular  symmetric  matrix  of  order  tn  = 2N  + 1 . 

— m 

(2) 

Proof:  (due  to  A.  Adler  at  U.b.C.)  Expanding  1 , we  have 

m 

..in,  „ (2)  4N+2  ^ 4N  ^ 4N-2  ^ . . 

-3  det  1 — 7i  + c.tj  + c-u  + ...  c„.,  where  the  c,  are 

/-r.\  12  2N  3 

a 

rational.  The  number  ir“  is  a transcendental  number,  whereas  the 

(2)  2 

vanishing  of  det  I would  imply  that  n is  an  algebraic  number. 

We  close  this  section  with  a derivation  of  the  formulas  of  Theorem 
2.10  for  the  case  of  the  important  intervals  [0,1],  [-1,1],  [0,"], 
and  [-",“]  . 


Ex . 1 : T = [0,1].  In  this  case 


(2.49) 


V = {z:  |arg(y^--)  j < d) 


Let  us  assume  that  the  coefficients  r of  a second  order  equation  are 
analytic  and  bounded  in  V,  and  that  the  same  is  true  of  r'  and  r" 
It  is  then  convenient  to  take 


(2.50) 


bW  = T7 


*'(x) 


x(l-x) 


The  conditions  of  Theorem  2.10  are  satisfied  if  f is  analytic  and  bounded 
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on  P,  and  if  on 


a are 


[0,1],  | f (x) | y C [x(l-x) ] a , where  C and 
positive  constants.  If  f does  not  vanish  at  0 and  1,  we  replace 
f by  F in  the  differential  equation,  where 

(2.51)  F(x)  = f(x)  - a(l-x)  - bx 

and  where  a = f(0),  b = f ( 1 ) . The  basis  functions  are 

(2.52)  *Sk/X^k=-N  = tx(1~x)s(k:>h)04,(x)  ^=_N  ■ 


To 

these , 

it  may  be 

necessary  to  adjoin  1 - x 

if  a 

is  unknown,  and 

X 

if  b 

is  unknown 

. Differentiating  g and 

$ ' , we 

get 

(2. 

.53) 

g'OO 

= 1 - 2x,  g"(x)  = -2, 

♦"00  = 

1 - 2x 

2 . .2  ’ 

Hence 

;(rg)(x)  = x(l-x)r(x)  (jf ) (x)  = x"(l-x)“r(x) 

♦ 

(•7^-)(x>  = x(l-x)  [x(l-x)r'  (x)  + (l-2x) r (x)  ] 

9 

(2.54)  j(-^-)(*)  = x(l-x)[x(l-x)r"(x)  + 2(l-x)r’(x)  - 2r(x)] 

; (2.(lg)-'  -L+rb.t")  (X)  = 2x(l-x)r ' (x)  + (l-2x)  r (x) 

♦ 

[ (rgb ' ) (x)  = r (x) 

Hence  we  get  the  approximations  (2.48),  in  which  x^  ■=  -y  + y tanh(kh/2) 
Fx . 2 . f = [-1,1].  In  this  case 
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$(z>  = log(-~)  , *’(z)  = 

1-z 

(2.55) 

P = {z:  ]arg(-^-)  | < d}  . 

Under  assumptions  on  r similar  to  those  of  Ex.  1,  we  take 

(2.56)  8<x>  - - |U-X2) 

The  conditions  of  Theorems  2.9  and  2.10  are  satisfied  if  f is  analytic 
and  bounded  on  P,  and  if  on  (-1,1),  |f(x)j  <=C(l-x‘')a,  where  C and 
a > 0 . If  f does  not  vanish  on  -1  and  1,  we  set  f = F + p in 
the  differential  equation,  where 

(2.57)  p(x)  =a^+b^ 

and  where  a = f(-l),  b = f(l)  . The  basis  functions  are 

(2.58)  {Sk(x)}k=-N  = {f(l-x2)S(k,h)o<Kx)}JL_N  ; 

to  these  it  may  be  necessary  to  adjoin  (l-x)/2  and/or  (l+x)/2  if  a 
and/or  b are  unknown.  Differentiating  g and  <j> ' , we  get 

(2.59)  g’  (x)  = -x,  g”(x)  = -1,  4>”(x)  = x(-~-)2, 

1-x 

so  that 
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(2.60) 


(rg)  (x)  = *(l-x2)r(x)  ; (|&)  (x)  = (~-)2  r(x) 

O^K*)  = (~ -)2r'(x)  - x(-^2_)r(x) 

2 2 
= (iy^-)2r"(x)  - x(l-x2)r'(x)  - (^— )r(x) 

(2  (rg)  ' + rg<{>"/ 0 ' ) (x)  = (l-x“)r'(x)  - xr(x) 


((rgiji'  ) (x)  = r (x) 


Hence,  with  x^  = tanh(kh/2) , the  approximations  of  Theorem  2.10  take 
the  form  (2 . 48) . 


Ex. 3:  The  case  T = [0,“]  . In  this  case 


(2.61)  ?(z)  = log  z,  $'(z)  = -,  V = {z:  |arg  z|  < d}  . 


Suppose  that  the  coefficients  r and  the  derivatives  of  r are 
analytic  and  bounded  in  V . If  on  V,  Jf(z)|  £,  C | z | /(l+|z|) 

2 

where  C,  a are  positive  then  it  is  convenient  to  take  g(z)  = z/(l+z)  , 
in  order  that  the  conditions  of  Theorems  2.9  and  2.10  are  satisfied. 
However,  if  |f(z)|  4 C | z | a/  (1+|  z | ) 2+a  in  V,  where  C and  a are 
positive,  then  it  is  possible  to  choose  a simpler  form  for  g,  and 
S^(x) , namely 

(2.62)  g(x)  = |xy  = x ; Sk(x)  = g (x)  S (k,h)o  $ (x)  . 


In  this  latter  case 
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(2.63) 


(rg)(x)  = xr (x)  ({4)(x)  = x2r(x) 

( ) (x)  = x2r'(x)  + xr(x)  ; ( (x)  = x2r"(x)  + 2xr'(x) 

(2(rg)  ' + rg4>" /<{> ' ) (x)  = 2xr'(x)  + r(x) 

,(rg$')  (x)  = r(x) 


kh 

The  approximations  now  take  the  form  (2.48),  in  which  x^  = e 

If  f is  merely  bounded  on  V,  and  if  d = lim^  ^x“f ' (x)  , then 
we  replace  f by  F in  the  differential  equation,  where 


(2.64) 


where 


f(x)  = F(x) 


+ 


xc 

(1+x)2 


(2.65) 

If  the  limit 
g (x)  = x/(l+x) 


a = f(0),  b = f (°°)  , c = b - a - d . 


, • 2 
lim,  .x 
(x-*») 

or  g(x) 


f'(x)  does  not  exist,  it  may  be  better  to  take 
2 

= x/(l+x)“,  depending  upon  the  problem. 


Ex. 4 : The  case  T = [-<*>,«]  . In  this  case 

(2.66)  <£(z)  = z,  <t>'  (z)  = 1, 

and  V = (see  Eq.  (2.12)).  If  the  coefficients  r of  the  differential 

equation  are  analytic  and  bounded  in  1)^ , and  if  f £ B(P^) , we  simply 
take 

(2.67)  g'(x)  = 1,  {sk(>0^=_N  = iS(k,h)(x)}jJ=_N 

in  order  that  the  conditions  of  Theorem  2.9  become  satisfied,  and  provided 
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I 


I 


r 


that  f vanishes  at  ±»  . 


The  conditions  of  Theorem  2.10  also  become 


satisf ied 


(2.68) 


if  |f(x)|  4 Ce  alx!  on  V . Then  x^  = kh,  and 

(rg)  (x)  = r (x)  , (^£)(x)  = r (x)  , (x)  = r'(x) 

■ (-^^— ) (x)  = r"(x),  (2  (rg)  ' + rg$"/4>’)(x)  = 2r’(x) 

. (rg<f> ' ) (x)  = r (x)  . 


The  approximating  equations  again  take  the  form  (2.48). 

If  f does  not  vanish  at  ±°°,  we  replace  f by  F,  where 

(2.69)  F (x)  = f(x) - [ e_CXf (-°°)  + eCXf(»)] 

cx  , -cx  7 J 


and  where  0 < c < tt / ( 2d ) . 


3.  Examples  of  Applications 


In  this  section  we  shall  illustrate  the  application  of  the 
formulas  developed  in  Sec. 2,  on  the  solution  of  some  simple  ordinary 
and  partial  differential  equations. 

Ex , 1 : Consider  the  simple  problem 

(3.1)  fxx(x)  = -2>  0 < x < 1 ; f(0)  = f(l)  = 0 

This  has  the  solution  f(x)  = x(l-x)  . By  taking  r(x)  =1  in  (2.54) 
and  combining  with  (2.48),  we  arrive  at  the  system  of  equations 

(3.2)  h[-2A  (x(l-x))  + f lf1)  A (x(l-x) ) + \ I (2) ]f_ 

h 

= -2hA  (x2(l-x)2)e 

m a** 


T 

where  e = (1,1,..., 1)  , T denoting  the  transpose.  Solving  this 
system  for  the  case  N = 2,  h = x,  = 4 + •=•  tanh(kh/2),  we  get  the 
approximations 


f_2  = .00575,  f_1  = .0856,  fQ  = .2495,  f = .0856,  i = .00575 

These  are  accurate  to  3 significant  figures. 


Ex . 2 . f"  = f - f2/x2,  f (0)  = f(“)  = 0 . This  problem  was  solved 

kh 

by  different  procedures  in  [1]  and  [5).  By  taking  = e and 
combining  (2.63)  and  (2.48),  we  get  the  approximating  system 


(3.3) 


[j(l) 

<-«jn 


£ i(2)]f  = 


A (x  )[f  - el 
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where  jm  denotes  the  vector  [x^f^,  x"2^]1  . 

The  solution  of  (3.3)  involves  the  solution  of  a system  of  nonlinear 
equations.  By  taking  h = 1,  N - 2 (m  = 5)  we  get  the  approximations 


f_2  = .0855  f_x  = 1.325  fQ  = 5.834  ^ = 1.114  f£  = .1132 


which  are  accurate  to  3 significant  figures. 


Ex. 3.  u =u  , 0 < x < 1 , t>0 

r xx  t ’ 


(3.4)  l-u(x,0)  = sin  ttx,  u(0,t)  = u(l,t)  = 0 . 


In  order  to  get  zero  boundary  conditions,  we  set 


u = v + sin(ux)e 


This  yields  the  problem 


2 -4t 

vxx  - v = (n  -4)sin(TTx)e  , 0 < x < 1,  t>0 

v(x,0)  = 0,  v(0,t)  = v(l,t)  = 0 . 


We  solve  this  by  taking  our  approximating  basis  functions  to  be 


S^(x)  = x(l-x)S(k,h)o<f(x)  , 4>(x)  = log[x/(l-x)] 

S^(t)  = t S(£,h)o (t) , if  (t)  = log  t . 


The  problem  (3.6)  may  now  readily  be  reduced  to  a matrix  problem,  by 
proceeding  as  for  (3.2)  above.  Setting 
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Eq.  (3.14)  way  bo  solved  by  diagonalizing  BD  and  EC  . If 

a-n’W-"'an  and  , . . . , pv  denote  tlie  eigenvalues  of  BD 

and  EC_  respectively,  obtained  by  taking  X |_BDX  and  ZECZ  ^ via 

e.g.  tlie  method  of  Calub  and  Kalian  [3],  and  if  G = [g  ] = X ^ i'Z  ' , 

1<  L.  ^ <w* 

1 = lykc]  then  = gkt/(>k  + lJ t } * anJ  I = Ml  ■ 

By  taking  h = 1,  N = 2,  we  get  a solution  which  is  accurate 
to  3 dec.  on  [0,1]  x [0,°°]  . 


Ex . 4 


(3.15) 


UXX  + Uyy  = _1’  S I°,l]  X [0,1] 

u = 0 on  3S 


Letting  B and  D be  defined  as  in  (3.9)  and  (3.11)  we  now  get  the 
approximating  matrix  system 

(3.16)  BDU  + UDB  = -DHD 

-r  r—  /•>  — 

where  = [u^]  , = [h^,],  h^  = 1 • This  may  now  be  readily  solved 

via  the  diagonalization  of  J1D_  . By  taking  X = 2,  h = 1,  we  get  a 


solution  which  is  accurate  to  3 dec. 


4.  Error  Analysis 


For  sake  of  simplicity,  we  shall  restrict  ourselves  to  the  simpler 
case  of  the  second  order  problem 

(4.1)  u"  + f(x,u)  = 0,  u(0)  = u(l)  = 0 

The  analysis  for  the  case  of  other  ordinary  or  partial  differential 
equations  is  somevihat  more  complicated,  but  may  be  carried  out  similarly. 
Throughout  this  section  a ,C^  ,C?  , . . . ,C^  denote  positive  constants, 

i 

and  h = [ nd/ (oiN)  ' ] . 

In  the  notation  of  the  previous  sections,  we  take  ( z ) = log[z/(l-z) ] , 
and  we  take  the  domain  of  analyticity  to  be  V = {z : | arg [ z/l-z) ] | < d}  . 

We  shall  assume  that  (4.1)  has  a (locally)  unique  solution  u^  which  is 
analytic  and  bounded  in  V and  which  satisfies  the  inequality 

(4.2)  |u0(x)|  < C1x“(l-x)a,  0<Xil 


Definition  4.1.  Let  M(d,ct)  denote  the  family  of  all  functions  v 
that  are  analytic  in  V,  such  that 


(4.3) 


where 


v(0)  = v ( 1 ) = 0 

' gv" f B(P),  |g(x)v"(x)|  £C,xa  1(l-x)a  1 on  (0,1)  ; 

gf(-,v)<f  B(P),  | g (x) f (x , v (x) ) | 4 C3xJ  1(l-x)a  1 on  (0,1)  ; 


(4.4)  g(x)  = x(l-x)  . 

We  shall  also  assume  that  the  solution  of  the  Frechet  derivative 
problem 
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(4-5)  0"(x)  + f^(x,u(x))0(x)  = w(x)  , 0(0)  = 0(1)  = 


satisfies 


(4.6)  |6(x)|  < c4||a-1w| | 

for  all  u6M(d,a)  such  that  ||u-Uq||  < t where  ||*|| 
by 


(4.7) 


| |f  | I = sup  |f(x)  I , 

xf (0 ,1) 


where 


(4.8) 


(A_1f)(x) 


1 

G(x , t) f ( t)d t 

J0 


and  where  for  any  x [0,1], 


(4.9) 


G(x,t) 


(l-x)t  if  0 < t < x 


x(l-t)  if  x < t i 1 . 


Moreover,  we  shall  assume  that  if  j | u-u^ | | < e,  then 

(4-10)  | |{A_1f(t,u(t))}| | < C5  . 

Let  us  assume  that  we  have  found  an  approximate  solution 
N 

(4.11)  u (x)  = 1 u S(k,h)o$(x)  (m  = 2N  + 1) 

k=-N 


by  the  method  of  the 
(4.12) 


previous  sections,  and  let  us  set 


0 = u - un 

m m 0 


defined 
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Then 

(4.13) 

0"(x)  + 
m 

f (x,u(x))6  (x)  « u"(x)  + 
u mm. 

f (x,u  (x)) 
m 

for  some 

u between 

and  u , and  therefore, 
m 

by  (4 .5)  and  (4.6), 

(4.14) 

i°„(x)i 

<_  CA  | | u + A If  ( • ,u  ) | | . 
= 4 1 1 m * m 1 1 

Now  by  Theorem  2.10,  we  find,  by  taking  Sk(x)  = g(x)S(k,h)c>(J>(x)  » 
tanh(kh/2),  that 


(4.15) 


and 


(4.16) 


1 g(xk} 

[v"(x)  + f(x,v(x))]Sk(x)dx  = h ^ [v"(xk)  + f(xk>v(xk))] 


v"(x)Sk(x)dx 


(1) 


r g"(X-:)  6ki 

h V + {28,(xj)+g(xj)*"(xj)^,(xj)}  h 

6<2) 

+ g(x.)<J>'  (x.)  — 

J J 


in  which  the  error  of  either  term  on  the  right-hand  side  of  (4.15)  is 

% 

bounded  by  C^N  ^e  and  the  err0r  of  the  right-hand  side  of 


(4.16)  is  bounded  by  C^N‘e 


By  our  process  of  solution,  the 


numbers  u,  in  (4.11)  are  determined  such  that 
k 

N g" 


.(1) 


• (2) 


(4.17) 


jL»  “i'«t * (2bJ  + 8j  "t 'r- + gj*i 


+ h f(xk,uk)  =0,  k «=  -N,  -N4i N . 
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Theorem  4.2:  Let  the  numbers  (k  = -N.-N+l N)  be  determined 

by  (4.17),  and  let  u (x)  be  defined  as  in  (4.11).  Then 

m 

(4.18)  | u^(x)  - uQ(x)|  < 0 4 x 4 1 . 

where  u^  is  the  solution  of  (4.1). 


Proof : In  view  of  the  errors  in  the  approximations  (4.15)  and  (4.16), 
the  solution  of  (4.17)  is  equivalent  to  finding  a function  v (-  M(d,a), 
such  that 


(4.19) 


g(xk)  Ev 

*T5T)[»"(V  + 


k = -N , -N+l , . 


where  v(x,  ) 
k 


u^,  and  where 


(4.20) 


k = -N.-N+l, ... ,N  . 


Since  v f M(d,a),  it  follows,  for  any  t (0,1),  that 


(4.21) 


fr^IVU)  + f (t,v(t))  ] 


“ g(xk> 

~1  *'~(x  ) tv"(xk)  + f (xk,v(xk^  Js<k,h)©$ft) 


s in[  ( t)  /h) 

2tt  i 


g(z)[v"(z)+f(z,v(z))]dz 
[4>(z)— d>(t)  ]sin[TT4i(z)/hJ 


2 -1 

By  multiplying  (4.21)  by  4>'(t)  , taking  A of  each  side,  and  noting 
that  g(t)$’(t)  = 1,  we  get 


v ( x ) + {A  1f(t,v(t)) } (x) 


00  g 

' £ d^[vk  + £(vV]A~]l$,(t)2s(k,h)<,$(t)}(x) 

k=-»  rk 


2sin[Tr4i(t)/hi 

2ni 


g(z) [v"(z)+f (z,v(z))]dz^  , . 
(<J)(z>-d)(t)  )sin[7i<})(z)/h]  J V 
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Since  d'(t)  = l/[t(l-t)J,  it  follows,  by  taking 
t = [1  + tanh(u/2) ] /2 , x = [ 1 4-  tanh(w/2) ] /2 , and  using  (4.8)  and 
(4.9),  that 


(4.23) 


I^(h,x)  i A (t)“sinc[{$(t)-kh}/h]  }(x) 


1 - tanh(X'r/2) 
1 - tanh(u/2) 


sinc[ 


u-kh 

h 


]du 


1 + tanh(w/2) 
1 + tanh(u/2) 


sine  [ 


u-kh. 
h J 


du 


w 


On  the  interval  [-°°,w],  the  function  [ l-tanh(w/2) ] / [ l-tanh(u/2) ] 
increases  monotonically  from  [ l-tanh(w/2) ] /2  to  1 while  on  [w,°°] , 
the  function  [ l+tanh(w/2) ] / [ l+tanh(u/2) ] decreases  monotonically 
from  1 to  [l+tanh(w/2) ] /2  . For  this  reason,  it  may  be  shown  by 
a somewhat  lengthy,  but  simple  argument,  that 


(4.24) 


I I^(h,x)  | ^ 4?rh  . 


Similarly  if 
(4.25)  |l2(h,x) 


x e [0,1] 


and  z £ 3P,  we  can  show  that 
4> ' (t)  2sin[  tti})  (t)  / h ] 1 . | 2h 

2TTi[$(z)-<j>(t)]  J = d 


since  Im  <p(z)  = ±d 


(4.26) 


By  means  of  (4.19),  (4.22)  and  (4.25),  Fq . (4.21)  now  yields 


| v (x)  + {A  *f (t,v(t))}(x) | 

|Ii(h^>|  kL„Tr-+  |LNiil|vt  + fl^'V)' 


+ |l2(h,x) 


g(z) [v"(z)+f (z,v(z) ) ] , 

sin[Tr^(z)/h] 
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Using  the  bounds  given  in  (4.20)  and  (4.24),  we  bound  the  first  sum 
on  the  right-hand  side  of  (4.26)  by  C^N^^e  ; using  (4.3)  and 

(4.24) ,  and  recalling  that  = -j  + 1 tanh(kh/2),  we  bound  the  second 

sum  on  the  right-hand  side  of  (4.26)  by  C^e  ^ ; and  using 

(4.25)  and  the  fact  that  | sin[ (z) / h J | 4 sinh[-d/h]  if  z 6 3P, 
we  bound  the  integral  term  on  the  right-hand  side  of  (4.24)  by 

2h  N(g[v"  + f ( • ,v)  ] ) / [dsinli(itd/h)  ] = C^N  e d ''''  . Hence  for  all 

x £[0,1], 

(4.27)  |v(x)  + {A_1f(t,v(t))}(x)|  < C12N3/‘!e_(:'daN)‘  . 

Since  v €>l(d,a),  it  follows  from  the  first  and  second  of  (4.3) 

that 

(4.28)  |v(x)|  < C13xa(l-x)“,  0 < x ■ 1 . 

Furthermore,  since  v £rM(d,a),  and  since  u and  v coincide  at 

m 

x X,,x  x, ,xXT , it  follows  that  [12,  Theorem  8.2]  for  all  x £ [0,1], 

-N  -N+l  N 

(4.29)  | u (x)  - v(x)i  C.  ,N:e"(7,daN,)'  . 

m la 

In  view  of  (4.5),  (4.6)  and  (4.10),  it  now  follows  that  lor  all  x t [0,1], 

(4.30)  | un)(x)  - (A  1 f (t  ,um(  t)  ) ) (x)  | 

£ |v(x)  + (A  1 f (t ,v(t) ) } (x) | + C^]u^(x)  - v(x)j 

By  (4.14),  (4.27),  (4.29)  and  (4.30),  it  thus  follows  that  for  all 

x [0,1] 

(4.29)  1 0^  (x)  | = lum(x)  ~ uq(x)|  t ^ dj") 
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This  completes  the  proof  of  Theorem  4.2. 

O 

Similarly,  it  may  be  shown  that  when  using  n = (2N+1)  points 
to  obtain  an  approximate  solution  of  a partial  differential  equation, 

■},  2. 

such  as  (3.15),  the  error  is  bounded  by  C’  N3/,2e~yN  < sr  n3^0_1n 

lu  = e 

Indeed,  for  the  case  of  (3.15),  we  may  take  CJ6  = 1 and  y = ti2  . 
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ON  BOUNDARY  EXTRAPOLATION  AND  DISSIPATIVE  SCHEMES  FOR  HYPERBOLIC  PROBLEMS 

Moshe  Goldberg* 

Department  of  Mathematics 
University  of  California 
Los  Angeles,  California,  9002k 

ABSTRACT.  In  this  note  we  consider  dissipative,  stable  approximations 
to  well-posed  linear  hyperbolic  initial  value  problems  in  the  quarter  plane 
x > 0,  t > 0.  We  show  that  if  boundary  values  are  determined  by  extra- 
polation, then  stability  is  maintained.  This  result  was  first  suggested 
by  Kreiss,  and  proved  explicitly  by  the  author.  The  proof  is  reviewed 
here  using  a stability  criterion  for  a certain  family  of  boundary  conditions 
due  to  Goldberg  and  Tadmor. 

The  Lax-Wendroff  scheme  and  other  dissipative  approximations  are 
applied  to  a test  problem.  As  expected  from  Gustafsson' s rate-of- 
convergence  theory,  computations  verify  that  if  the  boundary  extra- 
polation and  the  difference  scheme  have  equal  order  of  accuracy,  then 
this  order  is  preserved. 

1.  INTRODUCTION.  Consider  the  conservation  law 

(la)  du/dt  + dg(u)/cbc  = 0,  x > 0,  t > 0, 
and  assume  that  the  associated  initial  value  problem 

(lb)  u(x,0)  = f ( x) 

is  well-posed  in  L (0,«>),  so  no  boundary  values  are  required  at  x = 0. 
This  assumption  implies  that  characteristic  lines  do  not  carry  information 
from  the  exterior  of  the  domain  x > 0,  t > 0 inward. 

To  approximate  the  initial  value  problem  (l),  we  introduce  a mesh 
size  Ax  > 0,  At  > 0;  a grid  function  v (t)  = u(vAx,t),  v = 0,+l,+2,...; 
and  a consistent,  explicit  finite  difference  scheme 

(2)  vv(t  + At)  = S(vv_r(t),...,vv+p(t));  v=  1,2,3,..., 

r,p  being  fixed  integers. 


This  research  was  sponsored  in  part  by  the  Air  Force  Office  of 
Scientific  Research,  Air  Force  System  Command,  USAF,  under  Grant  No. 
AF0SR-76-30E6. 
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Since  nonlinearity  in  (la)  leads  to  nonlinear  dependence  of 
v^(t  + At)  on  the  components  of  v^  r(t), . . . ,vv+^(t),  we  are  unable  to 

be  more  specific,  at  this  stage,  about  the  structure  of  the  scheme  S. 

2 

However,  we  assume  that  S is  L -stable,  in  case  it  is  applied  to  the 
pure  initial  value  problem  for  -°°  < x < 

Usually,  r > 0,  so  it  is  impossible  to  approximate  (l)  by  (2) 
without  specifying  boundary  values  at  r grid  points  in  some  left 
neighborhood  of  the  boundary  x = 0.  Thus,  we  admit  boundary  conditions 
of  the  form 

s 

(3)  v^(t)  = 72  cjV^+J(t),  u = 0, . . . , -r  + 1, 

J 1 

where  the  coefficients  c . and  s > 1 are  fixed.  That  is,  having  the 
values  v^(t),  v > 1,  computed  by  the  basic  scheme  (2),  we  proceed,  at 
each  time  step,  by  using  (3)  to  determine  v^(t),  d = 0,-1,..., -r  + 1, 
in  that  order. 

A natural  way  to  choose  the  boundary  conditions  in  (3)  would  be  to 

employ  extrapolation  of  degree  s - 1 --  a procedure  which  is  of 

accuracy  of  order  s.  More  explicitly,  we  extrapolate  from  v1(t),...,v  (t) 

J.  s 

to  vQ(t);  then  from  v0(t),...,vg  ^(t)  to  v ^(t),  etc.  With  the  use 
of  Stirling's  extrapolation  formula,  (3)  becomes 


(»0 


v (t)  = 


s 

V 

Lj 

j=i 


(-1) 


d+i. 


0,. 


-r  + 1. 


The  main  purpose  of  this  note  is  to  study  the  influence  of  boundary 
extrapolation  on  the  stability  of  the  numerical  algorithm.  This  question 
is  discussed  in  Section  2,  where  we  consider  a scalar  linear  conservation 
law  which  we  approximate  by  a dissipative  scheme.  In  this  simple  case  it 
is  shown  that  boundary  extrapolation  maintains  stability.  This  result, 
which  was  first  suggested  by  Kreiss,  [53,  was  proven  explicitely  in  [l], 
using  Kreiss1  theory,  [6],  for  dissipative  approximations  of  mixed  initial- 
boundary value  problems. 

In  fact,  the  above  assertion  is  an  immediate  corollary  of  a forth- 
coming work  by  Goldberg  and  Tadmor,  [2],  which  provides  stability  criteria 
for  some  general  families  of  boundary  conditions,  including  those 
presented  in  (3). 


158 


Finally,  in  Section  3,  the  Lax-Wendroff  scheme,  [7],  and  a new 
5-point  dissipative  approximation  by  Gottlieb  and  Turkel,  [3],  are  applied 
to  a test  problem.  The  numerical  results  support  Gustafsson's  rate-of- 
convergence  theory,  [4],  by  showing  that  the  accuracy  order  of  the  basic 
scheme  is  maintained,  if  the  extrapolation  at  the  boundary  is  of  the 
same  order.  The  computation  were  carried  out  at  the  Campus  Computing 
Network  of  the  University  of  California  at  Los  Angeles. 


2.  STABILITY  ANALYSIS.  From  now  on  we  restrict  attention  to  the 
linear,  scalar  version  of  (l),  namely  to  the  initial  value  problem 

(5)  du/St  + adu/dx  =0;  a = const.;  x > 0,  t > 0;  u(x,0)  = f(x), 


which  is  well-posed  if  and  only  if  a < 0. 

Our  explicit  approximation  in  (2)  becomes 


(6) 


vv(t  + At)  . Qvv(t),  v = 1,2,3,..., 

Q = S a EJ,  Evv  = vy+1,  r > 0, 
j=-r 


where  the  constants  a.  depend  on  a and  on  the  fixed  ratio  A = At/Ax, 

V 

and  initial  values  are  determined  by 


v (0)  = f 


V = 1,2,3,.. 


The  assumption  of  dissipativity  is  that  for  some  5 > 0 and  natural 
w,  the  amplification  factor  of  the  scheme, 


satisfies 


qU)  = 


p 

Tj 

j=-r 


V 


i-ii 


■IT  < | < 7T, 


|QU)I  < 1 - b|i|2w,  V III  < IT- 

A 

Thus,  it  is  evident  that  Q(l)  is  now  power  bounded  (by  l),  which  is 
well  known  to  be  equivalent  to  the  (strong)  stability  of  our  bdsic  scheme. 
Introducing  the  boundary  conditions  (4),  the  concept  of  stability 
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r i 


becomes  considerably  more  complicated,  and  we  review  it  briefly.  Let 
H h H(Ax)  be  the  space  of  all  grid  functions,  w = (w^}”_  ^ , which 
satisfy  2 ^ |wv|^  <“  and  fulfill  the  boundary  conditions  in  (4).  If 
inner  product  and  norm  are  defined  by 

00 

2 

(v,w)  = AX  S W M = (w>w)> 
v=-r+l 

2 

then  H becomes  a discrete  analogue  of  L (0,x). 

Having  constructed  H,  we  realize  that  our  finite  difference 
algorithm  in  (4)  and  (6)  defines  a linear,  bounded  operator,  G : H -* H, 
such  that  the  numerical  solution  v satisfies, 

v(t  + At)  = Gv(t),  for  v(t)  e H. 

Since 

v(t)  = Gmv(0)  for  t = mAt,  m = 1,2,3,..., 

stability  means  that  the  powers  of  G are  uniformly  bounded,  i.e.,  that 
for  some  constant  K, 

||Gm||  < K,  m = 1,2,3,...  . 

We  are  now  ready  to  state  the  main  result: 

THEOREM  1.  Let  the  initial  value  problem  ( 5)  be  approximated  by  an 
arbitrary,  dissipat ive  (stable)  scheme  of  type  (6),  which  is  complemented 
by  boundary  extrapolation  of  arbitrary  order.  Then,  the  overall  numerical 
algorithm  is  stable. 

The  proof  which  is  laid  out  in  [l],  is  a direct  but  somewhat  lengthy 
application  of  Kreiss'  stability  theory,  [6],  for  dissipative  schemes. 

As  required  by  Kreiss'  criterion,  the  problem  was  to  show  that  the  corres- 
ponding operator  G has  no  eigenvalues  z in  the  unit  disk. 

In  a forthcoming  paper,  [2],  Goldberg  and  Tadmor  use  Kreiss'  theory 
to  provide  a particularly  simple  stability  condition  in  the  case  where 
scheme  (6)  is  augmented  by  boundary  conditions  of  type  (2).  This  condition 
is  rephrased  as  follows: 
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THEOREM  2 (Goldberg,  Tadmor).  Let  (6)  be  an  arbitrary,  dissipative 
(stable)  approximation,  augmented  by  boundary  conditions  of  type  (2), 
then  the  overall  algorithm  is  stable  if 
P i 

S c k ^ 1,  \/  * with  |ic  j <1. 

j=-r  J 

Theorem  2,  which  is  actually  independent  of  the  basic  scheme,  yields 
Theorem  1 immediately.  For,  considering  the  boundary  conditions  in  (4), 
we  want  to  show  that 

S ^(-1)^  1 ^ 1 for  k with  J k | <1, 

i.e.,  that  for  all  k with  |k|  <1, 

(1  - k)S  ^ S { 0. 

j=oW 

The  last  inequality  holds,  and  Theorem  1 follows. 

3-  NUMERICAL  RESULTS.  Consider  the  test  problem 

(7)  du/dt  - 8u/8x  = 0;  x > 0,  t > 0;  u(x,0)  = sin  2ttx 
whose  analytic  solution  is 

u(x,t)  = sin  2t r(x  + t). 

The  second  order  accurate  Lax-Wendroff  scheme  (L-W),  [7],  is  in  this 

case 

(8)  vjt  + At)  = - l)vv-1(t)  + (1  - \2)vv(t)  + g(X  + l)vv+1(t),  X = gj, 

and  it  is  well  known  (e.g.,  [8,  Chapter  12])  that  dissipativity,  and 
hence  stability,  are  guaranteed  if  X < 1. 

In  order  to  apply  (o)  to  ( J ) we  need  to  specify  only  one  boundary 
value,  vQ(t),  which  according  to  (4),  is  given  by 

v»lt) " viM' 

Here  the  accuracy  of  the  boundary  extrapolation  is  of  orders  s where 
s is  arbitrary. 
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In  Table  1 we  compare  the  L-W  results  with  the  analytic  solution 
at  t = 1.  The  H-norm  of  the  error  was  computed  over  the  interval 
0 < x <1,  and  is  defined  by 

(9)  lie Ii7 0 = Ax  £ [v  (l)  - u( vAx,l) ] , J = 1/Ax. 

v > ' v=0 


Ax 

S 

m 

l|e|l(0,l) 

.05 

2 

4o 

5.57  - 2 

.025 

2 

80 

OJ 

1 

& 

rH 

.05 

1 

4o 

7.03  - 2 

.025 

1 

80 

2.23  - 2 

Table  1.  L-W  results  at  t = 1; 

X = l/2;  m = t/At  is  number  of  time  steps. 

Gustafsson,  in  his  rate -of -convergence  theory,  [4],  has  discussed 
situations  similar  to  the  one  under  consideration.  He  has  shown  that  in 
order  to  maintain  the  accuracy  of  the  basic  scheme,  it  is  sufficient  to 
employ  boundary  conditions  of  the  same  order  of  accuracy.  Indeed,  Table  1 
suggests  that  L-W' s second  order  accuracy  is  maintained  if  the  boundary 
extrapolation  is  linear  (s  = 2),  but  is  reduced  if  s = 1. 

A second  example  is  concerned  with  a family  of  centered,  5-point, 
dissipative  schemes  by  Gottlieb  and  Turkel  (G-T),  [3]*  The  family,  given 
in  (2.4)  of  [3],  depends  on  two  parameters  a and  0.  Choosing  O.  = l/2, 
o=l,  and  linearizing,  we  obtain  an  approximation  to  (j)  of  the  form 

vv(t  + At)  = - i)vv_2(t)  + X(X  - |)vv_1(t)  + (1  - ^X2)vv(t) 

+ 4(X  + §)vv+1(t)  - £0  + l^i),  4 = g§, 

where  the  dissipativity  condition  is  X < '^2/2. 

Now  we  need  two  boundary  values  which  are  given  by 

Vp(t)  = .^L(j)("1)d+1Vj(t)»  H = °’_1- 

The  error-norms  in  Table  2 are  computed,  as  in  the  previous  case. 
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over  0 < x < 1,  and  in  analogy  to  (8)  are  defined  by 

HeH(0,l)  = ^ S [vv(l)  - u(vAx,l)]2,  J = l/Ax. 


Ax 

X 

s 

m 

_ (0>1) 

.05 

.5 

4 

4o 

1.85  - 2 

.025 

.25 

4 

160 

1.16-3 

.05 

.5 

3 

4o 

2.42  - 2 

.025 

.25 

3 

160 

1.92  - 3 

Table  2.  G-T  results  for  t = 1. 

Unlike  the  L-W  scheme  which  is  of  second  order  accuracy  both  in 
time  and  space,  the  G-T  approximation  is  of  second  order  in  time  and 
fourth  order  in  space.  Since  the  boundary  extrapolation  is  taken  only 
with  respect  to  the  space  variable,  it  should  be  expected  that  in  order 
to  maintain  the  fourth-order  accuracy  in  x,  we  have  to  utilize  cubic 
extrapolation  (s  = 4),  regardless  of  the  fact  that  G-T's  accuracy  in 
time  is  only  of  second  order.  This  is  reflected  by  the  results  of 
Table  2. 
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ELLPACK:  A COOPERATIVE  EFFORT  FOR  THE  STUDY  OF  NUMERICAL  METHODS 

FOR  ELLIPTIC  PARTIAL  DIFFERENTIAL  EQUATIONS 

John  R.  Rice 
Purdue  University 

1.  BACKGROUND . In  the  summer  of  1975,  Garrett  Birkhoff  started  discussing  the 
possibility  of  a cooperative  effort  to  develop  and  evaluate  software  for  elliptic 
partial  differential  equations.  In  the  summer  of  1976,  James  Ortega  organized 

a small  meeting  of  interested  parties  to  explore  various  viewpoints  and  possi- 
bilities for  this  idea.  A framework  was  outlined  which  seemed  to  accommodate 
a number  of  people's  work  and  interests,  two  initial  projects  were  agreed  upon 
and  John  R.  Rice  was  selected  as  coordinator  of  the  ELLPACK  effort. 

Participation  in  ELLPACK  is  completely  voluntary.  Purdue  University  will 
provide  the  software  framework  and  define  the  structure  precisely.  Contributors 
can  prepare  programs  which  fit  into  this  framework  and  Purdue  will  incorporate 
them  into  ELLPACK.  It  is  assumed  that  contributors  will  submit  high  quality, 
portable  programs  to  ease  the  burden  of  integrating  programs  into  ELLPACK. 

2.  ELLPACK  OBJECTIVES.  The  primary  objective  of  ELLPACK  is  as  a tool  for 
research  in  the  evaluation  and  development  of  numerical  methods  for  solving 
elliptic  partial  differential  equations.  Various  software  components  can  be 
interchanged  and  the  resulting  performance  (accuracies,  efficiencies,  etc.)  can 
be  measured. 


ELLPACK' s use  as  a research  tool  requires  its  framework  to  be  convenient, 
flexible  and  modular.  Thus  it  will  be  suitable  for  educational  use  and  for 
others  who  have  easy  or  moderately  difficult  problems.  It  is  not  intended  that 
ELLPACK  be  directly  applicable  to  the  very  complex  applications  (e.g.  temperature 
distribution  in  a nuclear  power  plant  or  in  a rentry  vehicle) . Nevertheless, 
the  ability  to  quickly  solve  a variety  of  moderately  complex  problems  should  be 
valuable  in  many  areas. 

3.  THE  TWO  PROJECTS.  The  first  project  is  ELLPACK  77  which  is  based  on 
adaptations  of  existing  programs.  One  of  its  primary  objectives  is  to  test  the 
concept  of  a modular  approach  using  interchangeable  software  parts.  ELLPACK  77 
is  restricted  to  rectangular  geometry  in  2 or  3 dimensions.  Anticipated 
capabilities  include:' 


Operator  Approximation 


2-Dimensions : 


5-point  star,  9-point  star.  Collocation  and  Galerkin 
with  Hermite  cubics 


3-Dimensions : 


7-point  star,  27-point  star 


Special  Options:  Poisson  Problem,  Constant  Coefficients,  Self  Adjoint 
Form. 
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Equation  Solution 


Direct  Elimination  (Band  or  Profile) 

Nested  Dissection 

"Fast"  Methods  (Tensor  Product  and  FFT) 

SOR,  Accelerated  Jacobi,  etc. 

Conjugate  Gradient 

The  second  project  is  ELLPACK  78  where  the  primary  extension  is  to  non- 
rectangular  geometry,  an  area  where  some  group  members  are  already  active.  Other 
directions  which  may  be  followed  include: 

a.  Standard,  automatic  changes  of  variables 

b.  Enhancement  of  rectangular  domain  capabilities 

c.  More  operator  approximations,  e.g. 

HODIE  methods,  Hermite  cubics  in  3 dimensions.  Method  of  particular 
solutions , Capacitance  methods 

d.  More  equation  solvers,  e.g. 

Cyclic  Chebyshev,  Automated  selection  of  SOR  parameters 
3.  Parallel  processor  implementation 

It  is  hoped  that  a significant  part  of  these  capabilities  will  be  implemented 
by  late  1978.  At  that  point  the  ELLPACK  effort  will  be  evaluated  and  future 
efforts,  if  any,  considered. 

4.  TECHNICAL  OPERATION.  The  framework  and  ELLPACK  specifications  are  specified 
at  Purdue  University.  Careful  attention  is  given  to  making  ELLPACK  compatible 
with  a wide  range  of  interests  and  to  making  it  "easy"  to  contribute  to  ELLPACK. 

On  the  other  hand,  success  depends  on  certain  uniform  standards  and  conventions 
and  there  is  no  doubt  that  choices  will  be  made  which  some  contributors  find 
inconvenient.  It  is  assumed  that  contributors  are  experienced  in  producing 
portable,  understandable  and  good  quality  software.  The  main  technical  document 
is  the  Contributor's  Guide  to  ELLPACK  [Rice,  1976]  which  defines  the  ELLPACK 
environment  for  a potential  contributor.  There  is  also  a shorter  User's  Guide 
[Rice,  1977]  and  a guide  to  implementing  ELLPACK  at  locations  other  than  Purdue. 

The  following  assumptions  are  made  about  the  environment  for  ELLPACK: 

A.  A standard  Fortran  compiler  is  available. 

B.  The  operating  system  allows  Fortran  preprocessors. 

C.  The  operating  system  allows  a reasonable  number  of  files  (probably  on  disks) 
to  be  created  by  a job  and  simply  manipulated.  Libraries  of  compiled  programs 
are  possible. 

D.  Certain  common  utility  routines  are  available.  These  include  file  copying 
and  concantenation,  timing  routines,  and  hard  copy  graphical  output.  The 
lack  of  some  of  these  utilities  can  be  circumvented  by  deleting  the 
corresponding  features  from  ELLPACK. 
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5.  ELLPACK  77  PREPROCESSOR.  The  user  defines  a problem  to  be  solved  (including 
a fixed  grid  to  be  used)  and  then  specifies  a combination  of  modules  to  be  used 
in  its  solution.  The  basic  elements  in  ELLPACK  are  segments  and  we  list  the 
segment  types  along  with  a brief  indication  of  their  use.  An  example  ELLPACK  77 
program  is  given  at  the  end  of  this  paper. 

Group  1:  Must  appear  (only  once)  and  be  before  any  from  Group  2. 

EQUATION.  Specifies  the  partial  differential  equation. 

BOUNDARY.  Specifies  the  domain  and  the  boundary  conditions  to  be  satisfied. 
GRID.  Specifies  the  rectangular  grid  in  the  domain. 

Group  2:  May  be  used  as  needed  and  repeated  if  desired. 

DISCRETIZATION.  Specifies  the  method  to  discretize  the  equations  and  to 
generate  the  linear  equations. 

INDEXING.  Places  the  equations  generated  by  the  discretization  module 

in  some  desired  order  for  further  processing. 

SOLUTION.  Actually  solves  a linear  system  of  equations. 

OUTPUT.  Provides  various  requested  information. 

Group  3:  May  appear  anywhere  and  as  often  as  desired. 

* A comment. 

OPTIONS.  Specifies  various  options  selected. 

FORTRAN.  Contains  Fortran  subprograms  used  by  other  segments. 

Group  4:  Must  appear  after  all  Group  2 segments. 

SEQUENCE.  Specifies  the  sequence  for  the  Group  2 segments  to  be  used. 

Every  ELLPACK  program  ends  with  the  segment  END. 

Most  of  the  features  of  ELLPACK  77  input  are  illustrated  in  the  example 
below.  Note  that  ELLPACK  77  is  heavily  key-word  oriented.  Standard  naming 
conventions  are  used  such  as  referring  to  the  solution  as  U and  its  various 
derivatives  as  UX,  UYY , etc.  The  coordinates  are  always  called  X,  Y and  Z. 
See  [Rice,  1977]  for  the  detailed  syntax  of  the  ELLPACK  segments.  See  [Rice, 
1977a]  for  further  discussion  of  ELLPACK. 

This  work  is  supported  in  part  by  a grant  from  the  National  Science 
Foundation. 


FLLPACK77  EXAMPLE  - MARCH  18,  1977 


DEBUG  = 2 


* 

* 

OPTIONS. 

* 

EQUATION.  2 DIMENSIONS 

UXX*  ♦ 6 , U Y Y S ->**?  UXS  + 1 ,/(XfDUB9(Y)  ) U = COS(Y*X) 

* 

BOUND.  X = 0.0  , U = 0.0 

Y = 1.0  . UY  r X 

X = EXP(  1 . ) , I)  = FXPCY) 

Y = 0.0  , MIXED  = (l.?)U  (COS(X+.?) )UY  = EXP(X) 


* 

GRID. 


UNIFORM  X r 7 

NGRIOY  = 11,  0.,  0.06,  0.16,  0.?9,  O.'iS,  0.06, 

iaS,  0.9?5,  1.0 


DISCRETIZATION 

. 5-POINT  STAR 

INDEX! 1) . 

INDt  X ( 2 ) , 

NATURAL 

RED-BLACK 

SOL. 

SOR  1 

OUTPUT (AA) . 

OUT  ( 1) ) , 

OUTPUT (99)  . 

PLOT-TRUE 

PLOT-EPPOR 

T ABL  E ( 5 , 6 ) - U 

S PLOT-DOMAIN 

1 MAX-ERROR 

OPTIONS. 

TIME  S memory 

SEQUENCE.  OUTPUT ( AA ) % DIS  * 

FORTRAN. 

FUNCTION  TRUF ( X , Y ) 

INDEX!  1)  i 
INDEX!?)  i 
OUTPUT ( 99 ) 

SOLUTION 

SOLUTION 

TMUr  = FXP(X+Y)/( 1 . +DUB9 ( Y ) 1**2 

HL  T URN 

END 

FUNCTION  DUR9CT) 

PUB9  = T*(T+,5)*EXP(T/(1,6T*C0S(T))) 

RF  TURN 

END 


FND. 


0.B7S 
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ABSTRACT 

A Continuum  Mechanics  Center  has  been  established  for  the  purposes 
of  evaluating  and  developing  models  of  interacting  continua.  Because 
of  the  large  and  growing  body  of  literature  concerning  such  models  and 
related  computer  codes,  the  vast  number  of  assumptions  made  in  their 
use,  and  the  varying  types  of  numerical  methods  utilized  in  these  codes,  a 
data  base  analysis  system,  CREATABASE,1  was  used  to  store  information  and 
characteristics  of  the  different  codes. 

This  paper  briefly  describes  CREATABASE,  delineates  the  data  base,  des- 
cribes queries  made  on  the  data  base  and  outlines  future  uses  and  expansion 
of  the  data  base  and  the  data  base  analysis  system. 


1 Daniel  Analytical  Services  Corporation,  "User  Reference  Manual  for  The 
CREATABASE  Module  of  an  Integrated  Data  Base  Analysis  System:  Level  U-4A," 
Houston,  TX,  August  1976. 
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I . INTRODUCTION 


The  formation  of  a Continuum  Mechanics  Center  (CMC)  at  the 
Ballistic  Research  Laboratory  (BRL)  to  study,  evaluate,  and  develop  large 
hydrodynamics,  solid  mechanics,  particle  transport,  and  heat  transfer 
computer  codes  presented  an  excellent  opportunity  to  simultaneously 
generate  a dat?  base  containing  information  on  systems  of  partial 
differential  equations  and  their  solutions. 

A questionnaire  (Appendix  I)  was  developed  and  sent  to  a number  of 
BRL  scientists  soliciting  information  regarding  codes  of  interest.  The 
response  furnished  data  on  20  codes  and  led  to  the  formation  of  a data 
base  from  which  significant  information  can  be  derived. 

CREATABASE,  a commercial  data  base  analysis  system  marketed  by 
Daniel  Analytical  Services  Corporation  (DANAYLT)  of  Houston,  Texas,  was 
used  to  store  data  for  retrieval.  CREATABASE  is  a relational  2'3  data 
base  system,  written  in  FORTRAN,  which  runs  on  the  UNIVAC  1100  series 
computers. 

Queries  are  made  using  English-like  statements  and  may  be  made  in  a 
batch  or  interactive  processing  mode.  The  output  for  each  mode  is  slightly 
different.  CREATABASE  affords  little  in  the  way  of  report  generation;  that 
is,  formatted  output.  CREATABASE  does,  however,  offer  the  user  the  capa- 
bility of  outputting  all  or  any  part  of  the  data  on  an  auxiliary  file  which 
the  user  can  then  process  in  any  fashion  desired  including  report  formats. 

Sixty-two  descriptors  form  the  total  domain  of  the  current  data  base. 

It  offers  the  user  an  accessible  and  easily  used  tool  for  ascertaining 
characteristics  and  capabilities  of  certain  computer  codes  at  BRL.  Infor- 
mation such  as  the  code's  applications,  numerical  method,  spatial 
geometry,  equation(s)  of  state  and  reports  dealing  with  the  code  and  its 
performance,  as  well  as  32  other  items,  are  included  in  the  data  base.  How- 
ever, specific  data  about  solutions  of  equations  such  as  subroutine  names 
in  which  various  processes  occur,  the  actual  equations  solved,  or  anomalies 
of  systems  of  equations  in  a particular  code  do  not  now  constitute  a part 
of  the  data  base.  Although  this  information  should  be  available  in  a 
user's  manual,  a more  accessible  information  source  is  desirable.  As  the 
data  base  develops  such  items  will  be  considered  as  possiblities  for  in- 
clusion. 


2 Codd,  E.F.,  "A  Relational  Model  of  Data  for  Large  Shared  Data  Banks," 
Communications  of  the  Association  for  Computing  Machinery,  13,  No.  6, 
June  1970. 

3 Date,  C.J.,  An  Introduction  to  Data  Base  Systems,  Addison-Wesley , 

NY,  1976. 


Adding  new  information  on  codes  already  in  the  data  base  and 
cataloging  other  codes  are  a continuing  part  of  the  CMC's  activities. 
In  addition,  new  commands  to  allow  easier  querying  are  being  added  to 
the  CREATABASE  system. 


II.  THE  CREATABASE  SYSTEM 

CREATABASE  is  a relational  data  base  analysis  system;  that  is,  all 
of  the  data  which  forms  the  data  base  exists  in  tabular  form.  The 
columns  are  formally  called  descriptors  (domains  or  sets  in  relational 
terms)  and  contain  all  of  the  states  which  comprise  that  descriptor. 

For  example,  a descriptor  might  be  MAXIMUM  SPATIAL  DIMENSIONALITY  and 
contain  as  states  ONE  DIMENSIONAL,  TWO  DIMENSIONAL,  AND  THREE  DIMENSIONAL. 
Rows  are  formally  called  records  (n-tuples  or  relations)  and  are  formed 
by  selecting  one  of  the  possible  states  from  each  descriptor.  See 
Figure  1 as  an  example  of  an  input  record  (note  that  two  successive  commas 
indicate  there  is  no  data  for  that  entry) . 

CREATABASE  is  a compiler,  written  in  FORTRAN,  that  takes  statements 
written  in  an  English-like  language,  interprets  them  and  executes  them. 

The  program  is  very  compact  requiring  only  21,000  words  of  storage,  yet 
is  very  modular  consisting  of  42  subroutines. 

There  are  56  commands  in  CREATABASE  which  fall  into  seven  command 
categories  (see  Figure  2) . Since  an  explanation  for  each  command  appears 
elsewhere  (see  Reference  1)  the  commands  will  not  be  discussed  in  great 
detail  here.  It  should  be  noted,  however,  that  a data  base  can  be 
created  and  queried  with  as  few  as  four  commands. 

CREATABASE  does  not  have  an  extensive  report  generation  capability. 

It  does  indicate  how  many  hits  or  matches  have  occurred  and  what  per- 
centage of  the  data  base  the  number  of  hits  represent.  This  statistical 
information  can  be  used  for  designing  further  queries  and  to  check  the 
validity  of  the  data  base  itself. 

In  addition,  CREATABASE  does  allow  any  or  all  of  the  data  in  the  data 
base  to  be  output  onto  a file  for  further  processing  during  that  execution 
or  at  a later  time.  This  selective  retrieval  of  data  for  future  use  is  a 
most  useful  tool  for  scientific  processing.  Several  independent  programs 
exist  to  assist  the  user  in  unformatting  data  for  his  special  applications 
The  user  then  has  great  control  over  the  subsequent  handling  of  his  data 
in  addition  to  the  capabilities  provided  by  the  system  itself.  The  user 
may  interface  his  data  with  graphics,  simulation,  statistical,  or  reports 
generation  packages.  The  user  may  also  interface  CREATABASE  with  other 
data  base  management  packages;  for  example,  using  CREATABASE  for  the  pur- 
pose of  collecting  and  refining  data  and  the  other  packages  for  elegant 
output  forms . 


173 


Figure  2.  The  CREATABASE  Commands  and  Command  Categories 
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The  following  describes  briefly  how  a CREATABASE  data  base  is 
assembled  and  queried. 

The  user  determines  his  own  descriptor  names  which  may  be  up  to 
126  characters  long.  Descriptors  are  used  to  represent  both  numeric  and 
alphanumeric  data  but  a descriptor  may  only  represent  one  type  of  da>ta. 

A numeric  descriptor  may  represent  a Tange  of  numbers  and  carry  additional 
identifying  information  (a  label).  An  alphanumeric  descriptor  may  be  in 
either  of  two  forms;  it  may  represent  a series  of  states  all  of  which  are 
predetermined  (coded  descriptors) , or  it  may  represent  an  open  ended 
series  of  states  (name  descriptors) . Coded  descriptors  are  preferable 
for  alphanumeric  data  as  they  require  fewer  computer  storage  cells. 

Once  the  descriptors  have  been  defined,  data  are  input  using  one  of 
three  forms.  The  first  and  most  common  is  card  input.  Input  in  this 
fashion  is  in  free  format  and  may  be  accomplished  in  a batch  or  inter- 
active processing  mode.  The  second  input  mode  uses  a -formatted  file 
whereby  the  user  indicates  the  number  of  characters  in  the  input  string. 
The  third  input  mode  uses  a CREATABASE  file  which  is  in  a highly  coded, 
densely  packed  format.  A CREATABASE  file  generally  is  used  when  the 
user  wishes  to  operate  on  a subfile  which  he  has  previously  created 
(output)  on  another  execution,  or  earlier  in  the  current  execution. 

Once  the  data  has  been  input  it  may  be  examined  using  general  and 
printout  commands  shown  in  Figure  2.  If  there  are  errors,  these  may  be 
corrected  using  the  modification  or  subset  binary  file  commands.  One 
might  correct  the  data  base  by  reentering  the  entire  data  base  as  well. 

After  the  data  are  corrected,  the  data  base  is  ready  to  be  queried. 
Queries  aTe  accomplished  using  general,  selective  retrieval,  subset  binary 
file,  and  printout  commands  (see  Figure  2).  At  the  heart  of  the  querying 
commands  are  the  two  printout  commands  PRINT  and  HOW  MANY.  Commands  from 
the  other  query  categories  allow  manipulations  of  the  data  so  that  the 
specific  queries  may  be  answered  by  the  PRINT  and  HOW  MANY  commands.  In 
addition,  output  may  be  generated  using  the  subset  binary  file  commands . 

CREATABASE  has  full  Boolean  logic  capability  ("and",  "or"  and  "not") 
which  can  be  used  with  the  modification,  selective  retrieval,  subset 
binary  file,  and  printout  commands.  Using  Boolean  logic  one  may  retrieve 
any  datum  contained  in  the  data  base. 

The  general  form  of  the  output  commands  is  a descriptor  list  (the 
desired  output)  followed  by  a Boolean  expression.  A Boolean  expression 
is  a concatenation  of  quantifiers  using  the  Boolean  operators  "and",  "or" 
and  "not".  A quantifier  has  the  form  descriptor  name,  followed  by  a 
descriptor  value  (state) . 
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Samples  of  CREATABASE  queries  and  the  resulting  output  are  shown 
in  Section  V. 

In  addition  to  Reference  1,  an  annotated  guide4  and  two  samples  of 
CREATABASE  runs5  are  most  helpful  for  using  and  understanding  the  CREATA- 
BASE commands  and  their  interactions.  The  cited  references,  while 
forming  a complete  set  of  CREATABASE  system  documents,  are  terse  and  make 
the  use  of  the  system  seem  more  complicated  than  it  is.  Of  course, 
quite  complex  interactions  can  be  obtained  through  the  use  of  CREATABASE 
and  the  UNIVAC  EXEC  VIII  operating  system.  Such  interactions,  while 
noted,  will  not  be  discussed  here. 


III.  THE  CONTINUUM  MECHANICS  CENTER  DATA  BASE 

Data  for  the  CMC  data  base  was  gathered  using  the  questionnaire 
shown  in  Appendix  I.  Questionnaires  were  sent  to  a number  of  BRL 
scientists  who  supplied  data  which  were  then  used  to  define  the  descriptors 
(state  names)  for  the  data  base.  Sixty-two  descriptors  (see  Figure  3) 
were  used  to  describe  the  data.  The  data  base  was  designed  so  that  each 
record  of  data  provided  information  for  one  code.  The  descriptors  are 
divided  into  several  broad  categories:  those  dealing  with  (i)  the  type  of 
problems  treated  by  the  code;  for  example,  descriptors  8,  9,  10,  15,  16,  17, 
18,  24,  28,  29,  31,  32,  34,  36,  37,  (ii)  the  characteristics  of  the  code; 
for  example,  descriptors  1,  2,  5,  6,  7,  11,  12,  13,  14,  19,  20,  21,  22,  23, 
25,  26,  27,  30,  33,  35,  38,  39,  40,  and  (iii)  people  and  reports  connected 
with  the  code;  for  example,  descriptors,  3,  4,  41-62. 

Queries  are  often  initially  made  on  certain  descriptors  to  determine 
which  code(s)  can  perform  a desired  type  of  calculation.  Subsequent 
queries  can  then  be  made  to  obtain  more  detailed  information  concerning 
these  codes.  (For  an  example  set  of  queries,  see  Section  V).  Furthermore, 
other  data  bases  can  be  generated  from  the  current  data  base;  for  example, 
if  the  data  base  becomes  very  large,  one  consisting  only  of  reports  deal- 
ing with  the  codes  may  become  desirable. 

The  data  base  which  is  stored  in  35,000  words  on  the  UNIVAC  1108 
computer  currently  contains  data  for  20  codes  (see  Appendix  II).  Although 
there  are  now  only  20  records  in  the  data  base  significant  information 


4 Daniel  Analytical  Services  Corporation,  ""Primer"  for  "The  CREATABASE 
Module"  of  An  Integrated  Data  Base  Analysis  System:  Level  U-4A," 
Houston,  TX,  August  1976. 

5 Daniel  Analytical  Services  Corporation,  "An  Illustrative  Check  Deck  for 
"The  CREATABASE  Module"  of  An  Integrated  Data  Base  Analysis  System: 
Level  U-4A,"  Houston,  TX,  August  1976. 
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can  be  extracted  (see  Section  V).  Future  plans  include  expanding  the  data 
base  to  include  more  codes  and  more  reports.  However,  this  data  base  will 
not  become  a bibliography  for  different  hydrocodes. 


IV.  USING  THE  DATA  BASE 

The  data  base  is  operational  on  the  UNIVAC  1108  computer  at 
Edgewood.  As  such,  it  runs  under  the  EXEC  VIII  operating  system.  This 
section  will  provide  the  user  the  means  to  sign  onto  the  computer,  in- 
voke the  CREATABASE  system,  and  gain  access  to  the  data  base.  It  is  highly 
recommended  that  users  copy  the  data  base  files  onto  their  own  files  be- 
fore using  the  system.  If  this  is  impractical,  the  user  must  not  invoke 
any  commands  which  would  modify  the  data  base;  that  is,  the  user  must 
not  use  any  of  the  definition,  modification,  or  compilation  commands . 

The  following  describes  BATCH  mode  operation.  To  sign  onto  the 
computer  the  command  in  card  column  1 is; 

8RUN  IDENTIFICATION,  ACCOUNT  NUMBER,  CMCLIB,  TIME,  PAGES  OF  OUTPUT. 

The  user  must  make  arrangements  for  obtaining  an  account  number.  CMCLIB 
is  the  project  name  for  the  CMC  CREATABASE  data  base.  The  next  instruction 
(card  column  1)  is; 

@MISD*CAB.CAB  CMDIC1.D,  CMCMP1.C. 

The  MISD*CAB.CAB  invokes  the  CREATABASE  system;  CMDIC1.D  is  the  file 
containing  the  descriptor  attributes  (logical  unit  9;  see  Reference  1)  and 
CMCMP1.C  is  the  file  containing  the  compressed  data  (logical  unit  12). 

At  this  time  control  passes  from  the  UNIVAC  EXEC  VIII  operating  system 
to  CREATABASE.  You  are  ready  to  query  the  data  base  using  any  of  the 
permissible  command  categories;  general,  selective  retrieval,  subset 
binary,  or  printout.  A familiarity  with  the  CREATABASE  system  is  helpful 
to  minimize  the  time  spent  in  designing  queries  and  auxiliary  output 
(using  the  subset  binary  file  operations) . CREATABASE  commands  are  free 
form;  that  is,  there  are  no  card  column  restrictions  as  to  where  commands 
can  be  placed.  The  normal  CREATABASE  separator  is  the  comma  and  the  normal 
command  terminator  is  the  asterisk.  Not  all  CREATABASE  commands  need  a 
terminator;  however,  the  user  is  unburdened  by  using  a terminator  on  all 
commands.  The  user  has  the  option  of  changing  the  separator  and  terminator 
if  he  so  desires. 
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GENERAL  NUMERICAL  METHOD 
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ORDER  OF  SCHEME 
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INTERFACE  CAPABILITY 
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REPORT  TITLE  1 
REPORT  AUTHORS  1 
Report  title  2 
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Figure  3.  The  Sixty-Two  Descriptors  Used  for  the  Continuum  Mechanics 
Center  Data  Base 
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When  one  has  finished  his  CREATABASE  operations,  control  is  given 
back  to  the  UNIVAC  system  with  the  following  command  (card  column  1): 


1 


@FIN 

This  command  will  provide  time  and  cost  information  to  the  user. 

If  the  user  wishes  to  query  CREATABASE  using  the  interactive  mode, 
several  additional  commands  are  necessary.  First,  the  user  must  dial  up 
and  be  given  access  to  the  computer.  Next,  before  using  the  RUN  statement, 
he  user  must  identify  himself  using  a site  identification.  Site  identi- 
fications are  easily  obtained  and  are  well  marked  on  hard  wired  terminals. 
After  the  site  ID  has  been  entered  and  the  computer  has  acknowledged  it, 
the  procedure  is  as  described  above.  At  the  conclusion  of  the  terminal 
execution,  after  the  @FIN  command  has  been  issued,  the  user  must  issue 
an  @@TERM  and  wait  for  the  terminal  or  modem  light  to  go  out. 

Additional  aids  for  the  user  are  the  commands  CNTRL  Z to  erase  the 
last  character  typed  if  a mistake  was  made  and  @@X  T10  to  interrupt  out- 
put when  a query  is  producing  too  much  output.  Greater  knowledge  of  the 
EXEC  VIII  operating  system  and  CREATABASE  only  enhances  the  skill  of  the 
user  and  enables  him  to  do  more  complication  operations.  However,  the 
information  presented  here  is  sufficient  to  query  the  data  base. 


V.  SAMPLE  QUERIES  AND  OUTPUTS 

This  section  will  show  several  typical  queries  and  the  instructions 
used  prior  to  the  queries  so  that  the  user  has  the  proper  information 
for  querying  at  his  disposal.  A complete  list  of  the  62  descriptors  in 
the  CMC  data  base  can  be  obtained  by  using  the  following  command: 

>SHORT  DICTIONARY  STRUCTURE* 

SHORT  DICTIONARY  STRUCTURE* 

Notice  that  the  command  is  echoed  back  to  the  user  which  accounts  for 
the  repeated  line  of  output.  The  output  of  this  command  is  given  in 
Figure  3.  The  individual  states  of  any  descriptor  can  easily  be  deter- 
mined; for  example,  the  states  of  the  descriptors  GENERAL  NUMERICAL 
METHOD,  MAXIMUM  SPATIAL  DIMENSIONALITY,  SPATIAL  GEOMETRY  and  TYPE 
OF  FLUID  FLOW,  are  obtained  by  the  following  command: 

> DESCRIPTOR  SEQUENCE  12,  IS,  16,  32* 

DESCRIPTOR  SEQUENCE  12,  15,  16,  32* 

> DICTIONARY  STRUCTURE* 

DICTIONARY  STRUCTURE* 
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The  DESCRIPTOR  SEQUENCE  command  used  in  conjunction  with  DICTIONARY 
STRUCTURE  command  restricts  output  of  the  DICTIONARY  STRUCTURE  command 
to  just  those  descriptors  whose  sequence  numbers  appear  in  the  former. 
The  DICTIONARY  STRUCTURE  COMMAND  prints  the  name  and  complete  speci- 
fication of  the  requested  descriptors.  The  output  of  these  commands 
is : 

12.  GENERAL  NUMERICAL  METHOD 

OPTION  CODE  NUMBER  OF  CHARACTERS  IN  LONGEST  STATE  32 

CODE  NAME 

1 FINITE  DIFFERENCE 

2 FINITE  ELEMENT 

3 MONTE  CARLO 

4 FINITE  DIFFERENCE/FINITE  ELEMENT 

15.  MAXIMUM  SPATIAL  DIMENSIONALITY 

OPTION  CODE  NUMBER  OF  CHARACTERS  IN  LONGEST  STATE  17 

CODE  NAME 

1 ONE  DIMENSIONAL 

2 TWO  DIMENSIONAL 

3 THREE  DIMENSIONAL 

16.  SPATIAL  GEOMETRY 

OPTION  CODE  NUMBER  OF  CHARACTERS  IN  LONGEST  STATE  33 

CODE  NAME 

1 RECTANGULAR 

2 CYLINDRICAL 

3 SPHERICAL 

4 RECTANGULAR/CYLINDRICAL 

5 RECTANGULAR/SPHERICAL 

6 CYLINDRICAL/SPHERICAL 

7 RECTANGULAR/CYLINDRICAL/SPHERICAL 

8 SPECIAL  TREATMENT 

32.  TYPE  OF  FLUID  FLOW 

OPTION  CODE  NUMBER  OF  CHARACTERS  IN  LONGEST  STATE  23 

CODE  NAME 

1 INVISCID  COMPRESSIBLE 

2 VISCID  COMPRESSIBLE 

3 INVISCID  INCOMPRESSIBLE 

4 VISCID  INCOMPRESSIBLE 

5 NONE 
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The  user  can  now  make  an  intelligent  query  as  to  number  of  codes  in 
the  data  base  which  use  a finite  difference  method  and  which  calculate 
two  dimensional  cylindrical  inviscid  compressible  flows. 


> HOW  MANY  HAVE  GENERAL  NUMERICAL  METHOD, FINITE  DIFFERENCE  AND 
HOW  MANY  HAVE  GENERAL  NUMERICAL  METHOD, FINITE  DIFFERENCE  AND 
> MAX I MUM  SPATIAL  DIMENSIONALITY, TWO  DIMENSIONAL  AND 
MAXIMUM  SPATIAL  DIMENSIONALITY, TWO  DIMENSIONAL  AND 
>TYPE  OF  FLUID  FLOW, INVISCID  COMPRESSIBLE  AND 
TYPE  OF  FLUID  FLOW, INVISCID  COMPRESSIBLE  AND 
> SPATIAL  GEOMETRY, CYLINDRICAL* 

SPATIAL  GEOMETRY, CYLINDRICAL* 


The  response  for  this  query  is: 

ISOLATIONS  TOTAL  PERCENTAGE 

2 20  10.00 

Notice  that  the  number  of  hits,  the  total  number  of  data  base  items 
and  the  percentage  of  hits  to  total  items  is  always  displayed. 


Now  wishing  to  see  the  code  names  for  the  two  codes  satisfying  the 
above  query  we  ask: 

>PRINT  CODE  NAME  FOR  WITH  HOLD* 

PRINT  CODE  NAME  FOR  WITH  HOLD* 


The  HOLD  instruction  is  used  so  that  the  long  Boolean  expression  need 
not  be  repeated.  The  result  of  this  query  is: 

BLAST 

LASXPT 

ISOLATIONS  TOTAL  PERCENTAGE 

2 20  10.00 

Finally,  wishing  to  see  more  information  about  the  codes  BLAST  and 
LASXPT,  we  issue  the  following  sequence  of  commands: 

>INDENT  0* 

INDENT  0* 

>PRINT  1,2,3,4,5,7,8,20,24,25,34,41,42,43,44  FOR  WITH  1, BLAST  OR 
PRINT  1,2,3,4,5,7,8,20,24,25,34,41,42,43,44  FOR  WITH  1, BLAST  OR 
>1,  US  XPT* 

1 , USXPT* 
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The  INDENT  0 command  instructs  the  system  to  indent  zero  spaces 
(no  indentation)  between  outputs.  The  PRINT  command  illustrates  that 
the  numeric  value  of  a descriptor  may  be  used  in  place  of  its  name.  The 
results  of  these  instructions  are: 

BLAST 

UNSTEADY/ 2D/EULER I AN/S INGLE  MATERIAL/FINITE  DIFFERENCE/COMPRESSIBLE  FLUID 
CALCULATION  OF  MUZZLE  BLAST  FLOW  FIELDS/TP-4155  PICATINNY  ARSENAL/AD881 
523/DEC  1970 
T D TAYLOR 
FORTRAN 

OPERATIONAL/EASY  TO  RUN 
MUZZLE  BUST  CALCUUTIONS 
PERFECT  GAS  UW 
REFLECTIVE/FREE-SLIP 
70  by  70 
NONE 

C K ZOLTANI 
1 

EVALUATION  OF  THE  COMPUTER  CODE  BUST  DORF  HELP  AND  HEMP  FOR  SUITABILITY 
OF  UNDEREXPANDED  JET  FLOW  CALCUUTION  BRL16S9 
C K ZOLTANI 
LASXPT 

NONEQUILIBRIUM/RADIATION-HYDRODYNAMICS/ATMOSPHERIC  TRANSPORT  AND  RESPONSE/ 
PUSMA  CHEMISTRY/ USER  PUSMA/LASER  TARGET 

THE  BRL  NONEQUILIBRIUM  USER  PLASMA-TARGET  INTERACTION  CODE/BRL  DRAFT  REPORT 

JOSEPH  UCETERA 

FORTRAN 

OPERATIONAL/COMPLICATED  TO  RUN 
USER  PUSMA  INTERACTIONS 
PERFECT  GAS  UW 
TRANSMITT I VE/MOV ING 
50  BY  5 

NONEQUILIBRIUM 

JOSEPH  UCETERA/CONTINUUM  MECHANICS  CENTER/ BRL/ APG  MD/301  278  4353 
2 

USXPT  1 PUSMA  INTERACTIONS/BRL  DRAFT  REPORT 
JOSEPH  UCETERA 


ISOUTIONS 

2 


TOTAL 

20 


PERCENTAGE 

10.00 


These  examples  have  been  constructed  as  an  illustrative  group  and  are 
not  meant  to  be  complete  or  portray  all  of  the  capabilities  of  the  system. 
For  example,  the  reverse  numerical  and  alphabetical  sorting  capabilities 
have  not  been  shown. 
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VI.  DISCUSSION 

This  report  deals  with  the  CMC's  computer  code  data  base  which  uses 
CREATABASE,  a relational  data  base  analysis  system.  Besides  describing 
the  data  base,  the  manner  in  which  it  is  accessed  and  queried  is  also 
explained  and  corresponding  examples  are  given. 

CMC  personnel  are  not  only  the  designers  of  the  data  base  but  also 
are  its  primary  users.  Care  and  maintenance  of  the  data  base  is  one  of 
the  CMC's  functions.  In  addition  to  ensuring  correctness  of  the  current 
data,  the  center  will  add  new  data  as  it  becomes  available.  Such  data 
is  not  limited  to  that  defined  by  the  current  62  descriptors  since  new 
descriptors  will  be  added  as  required.  No  attempt  will  be  made  to  make 
the  data  base  a complete  reference  system  for  all  codes.  However,  the 
CMC  will  consider  and  catalog  not  only  the  most  promising,  but  also  the 
most  used  codes. 

In  addition  to  the  data  base  itself,  the  CREATABASE  system  must 
undergo  change  and  not  remain  a static  inflexible  tool.  One  area  in 
which  CREATABASE  can  be  improved  is  that  of  subtabling.  The  amalgama- 
tion of  like  descriptors  (for  example,  authors)  into  a single  descriptor 
will  allow  for  easier  querying  and  new  relations  to  be  formed.  For 
instance,  the  output  of  a query  involving  an  author  may  produce  his  co- 
authors for  a single  reference  or  all  his  co-authors  for  all  his  published 
works.  Furthermore,  short  queries  involving  a single  amalgamated 
descriptor  are  preferable  to  long  descriptor  lists  or  Boolean  expressions. 
Finally,  the  amalgamated  descriptor  will  alleviate  some  of  the  need  for 
handling  subset  binary  files  through  the  fortuitous  production  of  infor- 
mation. Another  area  in  which  CREATABASE  can  be  improved  involves 
limited  alphanumeric  searching  for  name  descriptors.  Such  a change  would 
not  only  extend  the  textual  capabilities  of  CREATABASE  but  free  the  user 
from  entering  artificial  data  for  several  types  of  applications  and/or 
exactly  specifying  a descriptor  state  in  the  Boolean  expression.  The 
form  for  this  extension  should  also  provide  for  a range  of  values  rather 
than  a specific  state.  Other  improvements  of  CREATABASE  are  possible; 
however,  the  two  items  listed  above  will  make  this  system  even  better. 

Finally,  the  use  of  this  data  base  is  encouraged  as  an  information 
retrieval  system.  Furthermore,  comments  and  suggestions  on  its  structure 
and  contents  are  welcome. 
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APPENDIX  I.  QUESTIONNAIRE  USED  TO  GATHER  DATA 
BASE  DATA 
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QUESTIONNAIRE 


1.  The  name  of  the  code  (acronym  plus  its  meaning)  is  

2.  List  the  following  information  on  the  user's  manual: 

Author  (s) 

Address  of  Authors 


Report  Number 

Date  of  Publication  . 

3.  Code  is  operational  on  the  following  computers: 
CDC  7600  UNIVAC  1108  BRLESC 


4.  The  following  people  are  knowledgeable  in  the  code's  use: 


5.  Primary  application  of  the  code  is 


Second  application  of  the  code  is 
Tertiary  application  of  the  code  is 


6.  The  type  of  mesh  used  is 


Lagrangian 


Eulerian  & Lagrangian 


7.  The  code  uses  the  following  general  numerical  method: 


finite  difference 


finite  element 


Monte  Carlo 


8.  The  code  uses  the  following  particular  numerical  method: 

characteristics  Lax-Wendroff  random  walk  Galerkin  multipass  integral 


9.  The  order  of  the  numerical  scheme  is 


not  applicable 


10.  The  code  performs  unsteady  calculation: 


yes  no. 


11.  The  code  can  treat  the  following  spatial  gecmetry(ies) : 


rectangular 


cylindrical 


spherical. 


12.  The  code  can  treat  the  following  spatial  dimensionality: 

one  two  three. 

13.  List  the  variables  computed  directly  from  the  transport  equations. 


14.  The  code  can  use  the  following  equations  of  state: 

Tillotson  perfect  gas  law  BRLGRAY  JWL  CHARTD  PUFF 


15.  The  code  can  apply  the  following  types  of  boundary  conditions: 

reflective  transmittive  non-slip  free-slip  moving  free  surface 

16.  The  maximum  grid  size  for  the  code  is  by  . 

17.  The  code  has  the  following  type  of  rezoning  capability: 

automatic  manual  none  unknown  . 

18.  The  code  does  the  following  type  of  radiation  transport 

equilibrium  non-equilibrium  none 

19.  The  code  does  the  following  type  of  energy  deposition: 

time-independent  time-dependent  none 

20.  The  code  does  the  following  type  of  chemical  reactions: 

equilibrium  non-equilibrium  none 

21.  The  code  does  the  following  type  of  atomic  re  actions: 

equilibrium  non-equilibrium  none 

22.  The  code  calculates  material  response:  yes  no. 

23.  The  code  treats  solids  as  an  elastic  plastic: 

yes  no  not  applicable  unknown  . 


unknown . 


unknown 


unknown . 


unknown 
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24.  The  code  has  an  Interface  capability: 

i 

ves  no  not  applicable  unknown. 

25.  The  code  can  handle  (number  of)  different  materials. 

26.  The  code  treats  the  following  Cvpe9  of  fluid  flow: 

lnviscid  compressible  viscous  compressible  invlscid  incompressible 

viscous  incompressible  none  unknown. 

27.  The  code  treats  shocks  by  the  following  method: 

artificial  viscosity  shock  fitting  

none  not  applicable - 

28.  The  code  solves  the  following  equations: 

conservation  of  mass  conservation  of  momentum  conservation  of  energy 

Boltzmann's  equation  • 

29.  The  code  is  written  in  the  following  computer  language(s) : 

FORTRAN  ALGOL  APL 

30.  The  code  has  the  following  special  features: 

strength  option  tracer  particles  combustion  option  sliplines 


31.  The  following  reports  contain  information  relating  to  the  code  itself  or 

the  code's  performance.  For  such  reports  give  author(s) , report  number(s), 
and  title(s)  or  key  words.  The  total  description  per  report  should  be  less 
than  120  characters  including  blanks. 


1. 


2. 


5. 


9. 
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Describe  the  salient  features  of  the  code  In  less  then  120  characters;  for 
example, 

HELP:  unsteady,  20,  Eulerian,  multi-material,  finite  difference.  Integral 
formulation,  solid  and  compressible  fluid  application*. 


Please  list  any  pertinenc  computer  code  properties  omitted 


and  any  other  comment*. 


* 
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APPENDIX  II.  THE  DATA  BASE 


>1  I HUS  i.b 

FIUITC  Dlf FI RCUCC/LAKfaC  CLASTIC-PLASTIC  TRANSIENT  DEFORMATION 


IHPrRBOLIC  EOS  USIN6  least  squares  f ormul  at  ion  INVISCIU  FLOkS 


195 


197 


N£AH  STATIC  ANALYSIS/ JUN  1976 


JU/TIME-UCI’INurNT/MONTE  CAHLO/NEUTRON  AND  GAMMA  TRANSPORT/FORRARO  and  AOjOINI/  PnlNT  AND  VOLUME  OF  TEC  TORS/ENUF  0 ASECTNS 
SAM-Lf  A THMIE-OINENSIONAL  MONTE  CARLO  CODE  TOR  SOLUTION  OF  FORWARD  NEUTRON/FORRARU  AND  ADJOINT  GAMMA/DNA2830F / JUL  TV 


FORTRAN 

:0C  7600  hUUTSVILLt 

DPCHAT IONAL/CORPL ICATtO  TO  HUN 


200 


L 4 


201 


203 


206 


208 


OCP£HDCNl/KIRCMOfF  SHELIS/OYNMUC  LOADING 


JNSIf  A0Y/2U/1 IJLLH  I AN/UNC  MATC.H  1 AL/F  INI  T£  DIFFFHENCF /INVISCIO  COMPRESSIBE  FLOW 
R IPPLE- A ?()  UNSTFAUY  EULERIAN  HYDRODYNAMIC  COOE-USCRS  MANUAL  DHL  REP0R116S2  FCB  73 
C 4 NILSON 
CL)R  IRAN 


A PARALLEL  ARRAY  COMPUTER  FOR  THE 
SOLUTION  OF  FIELD  PROBLEMS 

W.R.  Cyre*,  C.J.  Davis*,  A. A.  Frank*, 

L.  Jedynak*,  M.J.  Redmond*,  and  V.C.  Rideout* 


I . INTRODUCTION 


The  von  Neumann  single-processor  digital  computer  [1] 
has  dominated  the  computer  scene  for  many  years,  with 
improvements  in  speed  and  memory  capability  continuing 
to  appear,  accompanied  by  reductions  in  size  and  cost. 
However,  the  speed  possible  in  such  machines  is  bounded 
by  limits  on  the  speed  of  signal  transmission,  and  even 
the  fastest  and  most  powerful  single  processor  or  serial 
machines  have  been  found  to  have  severe  limitations  when 
applied  to  the  simulation  of  large  dynamic  problems  such 
as  the  earth's  weather  and  climate  [2,3,4].  The  adoption 
of  a limited  amount  of  parallelism  in  machine  architecture 
has  not  overcome  these  difficulties,  and  in  some  cases 
has  led  to  an  increase  in  software  problems.  Furthermore, 
attempts  made  to  replace  the  now  expensive  analog-hybrid 
machines  by  serial  digital  computers  for  simulation  of 
large  dynamic  systems  have  revealed  that  serious  inade- 
quacies in  speed  and  in  user-machine  interaction  remain. 

A new  parallel  array  computer  is  proposed  in  this 
paper,  of  a design  based  on  the  needs  of  those  concerned 
with  large  scale  simulation.  This  design  uses  the  small 
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and  low-cost  but  powerful  microprocessors  and  me.mory 
chips  which  are  now  available.  The  array  computer  can 
be  configured  to  have  some  degree  of  isomorphism  with  the 
physical  problem  being  solved,  as  in  the  parallel  analog 
computer.  It  is  felt  that  such  a computer  using  a large 
number  (N)  of  such  microcomputers  can  be  designed  to 
keep  software  simple,  while  approaching  a speed  advantage 
of  N times  over  the  single  processor  machine  [22], 

II.  APPLICATIONS  OF  PARALLEL  ARRAY  COMPUTATION 

Array  computers  have  been  classified  by  Flynn  [5]  as 
being  either  SIMD  (single-instruction  stream,  multiple-data 
stream)  or  MIMD  (multiple-instruction  stream,  multiple-data 
stream)  machines.  The  best-known  machine  of  the  SIMD 
type  is  the  ILLIAC  IV  machine  [6,7,8].  This  machine  was 
based  on  a previously  proposed  machine,  the  SOLOMON  com- 
puter [9],  which  was  conceived  with  the  weather  simulation 
problem  in  mind  [10]. 

The  ILLIAC  IV  and  most  other  parallel  array  machines 
appearing  in  the  literature  have  a limited  number  of  pro- 
cessors such  that  to  expand  apparent  array  size,  some 
serial-parallel  approach  to  large  problems  is  necessary, 
which  leads  to  software  complexity  [11,12,13].  The  advent 
of  the  microprocessor  and  IC  memory  technologies  makes  it 
possible  to  propose  a more  versatile  MIMD  type  of  parallel 
array  machine  with  enough  microcomputers  to  assign  one  to 
each  node  of  a discretized  partial  differential  equation 
simulation  (or  one  to  each  state  variable  of  a system 
simulation)  in  most  problems  in  the  classes  of  problems 
for  which  the  machine  was  built.  Where  absolutely 
necessary  more  than  one  equation  could  be  assigned  to  any 
node  with  some  degradation  in  performance.  Such  a machine, 
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the  Wisconsin  Parallel  Array  Computer  (WISPAC) , is  pro- 
posed, which  is  capable  of  being  extended  to  a three- 
dimensional  array  of  as  many  as  100x100x20  or  200,000 
processors,  each  connected  to  its  six  nearest  neighbors, 
together  with  a scheme  (see  Sec.  Ill)  for  fast  "pass- 
through" interconnection  paths  among  non-adjacent  as  well 
as  adjacent  processors.  This  parallel  array  computer 
could  be  used  in  the  study  of  a large  class  of  simulation 
problems,  as  listed  below: 

(1)  Finite  difference  solution  of  various  field  prob- 
lems described  by  partial  differential  equations 
(PDEs ) 

Here  the  more  challenging  dynamic  problems  would  be 
of  principal  concern,  and  the  basic  three-dimensional 
array  could  be  software-configured  to  also  accomodate 
problems  in  one  or  two  dimensions.  In  addition  to 
problems  of  weather  and  climate  simulation,  referred 
to  above,  geophysical,  plasma  confinement  and  fluid 
dynamics  problems  could  be  handled.  The  considerable 
increase  in  speed  possible  because  of  parallelism 
should  make  available  some  degree  of  on-line  inter- 
active operation  in  large  PDE  problems,  as  well  as 
parameter  and  initial  condition  estimation. 

(2 ) Solution  of  PDE  problems  by  finite  element  techniques 

Finite  element  methods  [14]  have  certain  advantages 
over  finite  difference  schemes,  and  have  been  valuable 
in  certain  static  problems  in  elasticity,  particularly 
in  cases  where  the  boundaries  are  irregular.  The  form 
ultimately  taken  by  a finite  element  formulation  is 
much  like  that  for  a finite  difference  formulation, 
so  that  an  array  machine  may  be  equally  useful  if  this 


213 


method  of  solution  is  used. 


(3)  Monte  Carlo  solution  of  PDEs 

It  has  been  shown  that  the  conditional  probability 
function  of  a Markov  process  [15]  satisfies  a second- 
order  elliptic  or  parabolic  PDE  and  this  has  led  to 
suggestions  that  this  method  be  used  in  problems  with 
difficult  boundary  shapes.  In  Monte  Carlo  solutions 
the  potential  at  a point  is  determined  by  averaging 
the  potential  encountered  at  a boundary  after  a large 
number  of  random  walks  started  at  the  point  and  termin- 
ating at  the  first  boundary  encountered.  Serial  digital 
methods  for  solution  by  this  approach  have  not  been 
satisfactory,  because  of  the  time  needed  to  make  1000 
or  more  runs  for  each  point  in  the  discretized  field, 
and  attempts  to  use  the  hybrid  computer  [16]  also  have 
not  led  to  success.  With  a parallel  array  computer, 
runs  could  be  made  simultaneously  from  all  M points 
of  interest  in  the  field,  and  a speed  advantage  over 
the  serial  computer  of  M times  might  be  approached. 

(4 ) Continuous  System  Simulation 

Continuous  dynamic  systems,  ordinarily  described  by 
sets  of  first-order  ordinary  differential  eauations 
in  the  state  variables,  have  often  been  studied  using 
analog-hybrid  computers,  particularly  in  the  aerospace 
industry.  As  in  the  case  of  PDE  boundary  value  prob- 
lems, the  parallel  nature  of  these  problems,  as  well 
as  requirements  on  speed  of  solution  make  the  parallel 
array  computer  a potentially  useful  tool  for  simu- 
lation studies. 

The  serial  digital  computer  is  now  being  used  for  simu- 
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lation  of  dynamic  systems,  sometimes  with  programming 
aids  such  as  CSMP  (Continuous  System  Modeling  Program) 
[17,18].  Although  increased  speeds  have  been  provided 
by  serial  machines,  the  sizes  of  problems  of  interest 
have  grown  even  faster.  Particularly  in  the  study 
of  socio-economic  systems,  electric  power  systems, 
ecologic-environmental  systems  and  physiological 
systems,  the  size  of  the  system  makes  the  on-line 
operation  of  dynamic  simulations  both  slow  and  ex- 
pensive. Without  the  possibility  of  user  interaction, 
simulation  tends  to  be  so  slow  that  it  is  often  much 
less  useful. 

A parallel  array  computer  of  the  kind  proposed  and 
consisting  of  100  to  200  microcomputers  would  be 
adequate  for  most  system  simulations  now  being  studied 
by  CSSL  or  hybrid  computer  simulation  techniaues  (as 
well  as  small  PDE-described  systems).  Solution  of  the 
equation  for  each  state  variable  would  be  assigned  to 
one  microcomputer  in  the  array,  with  the  interconnection 
set  up  by  software,  as  described  in  IV,  below. 

As  in  the  case  of  PDEs,  speed  requirements  are  greatly 
increased  if  the  needs  of  optimization  or  parameter 
estimation  make  repeated  simulation  runs  a necessity 
[19].  The  speed  advantage  of  a parallel  array  machine 
would  also  be  useful  in  system  problems  involving 
stochastic  signals  or  parameters. 

(5)  Discrete  System  Simulation 

Although  discrete  systems  such  as  those  arising  in 
simulation  study  of  transportation  problems,  queing 
and  cell-growth  are  better  handled  by  serial  machines 
than  on  continuous  systems,  the  increasina  size  of 
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these  problems  may  also  make  for  difficulties  in  the 
future.  Combined  continuous-discrete  systems  [20] 
offer  even  more  of  a challenge,  especially  if  combined 
with  interactive  simulation  requirements.  The  pro- 
posed computer  is  as  well-suited  to  simulation  of  dis- 
crete and  combined  systems  as  it  is  to  the  simulation 
of  continuous  systems. 

(6)  Data  Manipulation 

The  manipulation  and  reduction  of  data  in  such  appli- 
cations as  pattern  recognition,  radar  signal  processing, 
Fast  Fourier  Transforms,  etc.,  may  also  be  more  ef- 
fectively carried  out  by  parallel  array  processors, 
and  indeed  machines  have  been  devised  for  such  pur- 
poses [21,22],  The  proposed  array  computer  can  also 
be  adapted  to  these  applications,  which  are  now  being 
attempted  with  simpler  array  machines. 

Ill  HARDWARE  FEATURES  OF  WISPAC 

The  proposed  Wisconsin  Parallel  Array  Computer  will 
employ  a basic  two-level  architecture  which  can  be  ex- 
panded to  three  levels  for  larger  systems.  The  individual 
node  processors  which  form  the  computing  array  are  at  the 
base  level  of  the  hierarchy.  As  previously  stated,  node 
processors  are  arranged  in  a three  dimensional  cubical 
configuration  where  each  node  processor  (NP)  is  connected 
to  its  six  nearest  neighbors.  Each  NP  contains  a full 
microprocessor  CPU  with  instruction-decode  logic,  arith- 
metic-logic unit  (ALU) , addressing  logic  and  I/O  capability. 
Each  NP  also  supports  the  direct  communication  linkages 
to  its  six  neighbors  in  the  three-dimensional  array. 

Opposite  array  edges  are  linked  or  wrapped  around  to 
form  a hardware  wired  "hyper-torus".  Actual  wrap-around 
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configuration,  and  in  fact  all  node-interconnections,  are 
under  software  control.  A block  representation  of  an 
NP  is  shown  in  Fig.  1.  Each  node  processor  supports  its 
own  local  data  and  program  storage  memory.  This  differs 
from  the  SOLOMON  [9]  and  ILLIAC  [6]  type  parallel  pro- 
cessors where  each  processing  element  can  directly  use 
local  memory  only  for  data  storage.  Thus,  the  Wisconsin 
Parallel  Array  Computer  has  a Multiple  Instruction  Stream  — 
Multiple  Data  Stream,  or  MIMD  [7],  architecture,  such  that 
any  number  of  node  processors  may  execute  different  in- 
structions simultaneously. 

The  local  memory  associated  with  each  node  processor 
is  also  integrated  into  the  communication  system  between 
the  node  processors  and  the  next  level  of  the  hierarchy, 
the  sector  control  computer.  A sector  mav  be  defined  as 
a set  of  from  5x5x5  to  25x25x20  node  processors.  A diagram 
of  the  relationship  of  one  node  processor  to  its  local 
memory  and  to  the  sector  control  computer  (SCO  is  shown 
in  Fig.  2.  The  sector  control  computer  supports  its  own 
data  and  program  memory.  However,  the  local  memory  of 
any  node  processor  can  be  appended  to  the  end  of  the  SCC's 
dedicated  memory  to  allow  memory-to-memory  transfers  be- 
tween the  two  memory  blocks.  In  this  manner  local  pro- 
grams may  be  loaded  into  the  array  and  I/O  may  be  achieved 
during  array  operation.  The  memory  communication  bus  can 
also  be  routed  for  external  access  to  the  sector.  This 
allows  multiple  sector  expansion  of  the  array. 

Memory  cost  will  be  the  largest  portion  of  overall 
array  cost  therefore  memory  word  width  and  size  for  each 
NP  must  be  optimized  based  on  a study  of  the  systems  to 
be  simulated  in  the  computer.  Error  correcting  memory 
must  be  considered  for  the  node  processors  to  insure 
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reasonable  reliability.  Error  detection  and  correction 
for  the  node  processor  CPUs  must  also  be  studied.  One 
possible  method  for  CPU  error  detection  is  for  each  NP 
to  support  redundant  CPUs.  For  correction,  the  NPs  and 
the  sector  control  computer  should  be  software  compatible. 
When  a faulty  NP  is  detected,  the  SCC  can  execute  the 
NP's  tasks  directly  through  the  appended  memory  communi- 
cation system. 

The  array  computer  can  be  expanded  by  adding  array 
sectors.  Array  continuity  is  maintained  across  sector 
boundaries  setting  up  nearest  neighbor  linkages  between 
node  processors  at  sector  edges.  In  the  present  design 
of  WISPAC  each  fully  expanded  sector  can  support  up  to 
25x25x20  node  processors.  A full  complement  of  sixteen 
sectors  would  bring  the  maximum  array  size  to  100x100x20. 
When  more  than  one  sector  is  used,  it  is  necessary  to  add 
a master  array  computer  as  a third  level  to  the  hierarchy 
for  control  of  overall  array  utilization.  Under  master 
computer  control,  several  problems  could  simultaneously 
be  run,  each  in  a different  sector,  or  several  in  one 
sector.  (This  function  could  also  be  realized  within  a 
single  sector  under  sector  computer  control.)  The  array 
could  also  be  fully  utilized  by  a single  problem.  An 
overall  block  diagram  is  shown  in  Fig.  3.  (Note  that  the 
node  memory  bus  diagram  is  a simplified  representation  of 
that  shown  in  Fig.  2.) 

If  intra-array  communication  is  hardware  limited  to 
next  nearest  neighbor  linkages,  the  classes  of  problems 
that  can  be  easily  solved  is  somewhat  restricted.  There- 
fore a second  kind  of  intra-array  communication  is  proposed 
to  allow  a limited  number  of  high-speed  arbitrary  path 
linkages  between  any  pairs  of  NPs  in  the  array.  The  scheme 
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integrates  a commutation  capability  into  the  nearest 
neighbor  linkages.  An  output  linkage  on  one  of  the  six 
node  processor  sides  could  be  "switched"  either  to  carry 
data  from  the  output  port  of  the  microprocessor  associated 
with  that  side,  or  data  from  any  one  of  five  input  linkages 
(the  input  linkage  on  the  same  side  as  the  output  linkage 
is  excluded) . Diagrams  of  some  possible  configurations 
for  commutation  are  shown  in  Fig.  4 (here  two-dimensional 
representations  are  used,  for  simplicity).  A simulation 
problem  utilizing  commutation  (in  two  dimensions)  is 
diagrammed  in  Fig.  5 of  Section  V.  Commutation  operation 
could  be  controlled  by  the  microprocessor  at  each  node  so 
that  commutation  could  be  changed  during  program  runs. 

The  sector  control  computer  could  also  control  commutation 
globally.  With  this  scheme,  a limited  number  of  arbitrary 
paths  could  be  constructed  between  any  two  node  processors. 

IV  SOFTWARE  CONSIDERATIONS  AND  IMPLEMENTATION 

Given  a rectangular  three-dimensional  array  of  in- 
dependently programmable  microprocessors,  each  connected 
to  its  six  nearest  neighbors,  the  programming  and  solution 
of  a field  problem  such  as  the  diffusion  equation  by 
finite  differences  [23,24]  is  straightforward.  If  the 
basic  equation  is 

M = K 320  + a20  , a20 

8t  3x2  3y2  3 z2  (1) 

then  the  "molecule"  of  Fig.  5 gives  the  difference  form: 

0O,j+l  - V 


1 = ~ ((?E,  j+0W,  j+0N,  j+0S,  j+0U,  j+0D,  j_60O,  j 
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The  explicit  formulation  will  give  a new  value  of  0 
at  each  point  in  one  calculation.  However,  an  implicit 
formulation,  such  as  the  one  shown  below,  may  be  pre- 
ferred in  many  cases  because  it  permits  larger  steps 


in  time. 


^0, j+1  - 0O, 


= L_  (0E  1 

2h2 


♦ !L 

2h2 


-0,d  + D 


This  is  the  Crank-Nicolson  formulation,  and  requires 
some  implicit  form  of  solution.  Because  the  matrix  is 
sparse,  iterative  methods  of  solution  such  as  the  Jacobi 
scheme  are  used.  Such  methods  may  be  applied  in  the  parallel 
array  computer  by  requiring  each  micrcomputer  to  simul- 
taneously calculate  an  estimate  of  0n  . . (usina  0 . 

estimates  and  0_  .,  etc.),  then  exchange  estimates 
1 1 

0q  etc.,  and  determine  new  estimates  until  a convergence 

criterion  is  satisfied.  In  certain  regular  problems  the 
number  of  iterations  for  convergence  might  be  determined 
by  a computation  in  the  master  computer:  in  more  complex 
cases  a convergence  criterion  might  need  to  be  incorporated 
in  the  programs  of  some  or  all  of  the  microprocessors. 

Hardware  may  be  required  to  detect  overall  convergence 

in  iterative  solutions  in  order  to  maintain  solution  speed. 

One  possible  definition  of  global  convergence  could  be 

< t 

the  logical  "AND"  of  all  local  convergences'. 

It  should  be  noted  that  some  of  the  schemes  used 
to  improve  the  speed  of  solution  of  field  problems  in 
serial  digital  computers  (Gauss-Seidel , Alternating 
Direction  Methods,  etc.)  may  not  be  suitable  for  use  in 
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a parallel  array  machine,  which  depends  on  parallelism 
to  attain  its  speed  advantages. 

As  previously  stated,  the  proposed  WISPAC  is  a 
Multiple  Instruction  Stream  — Multiple  Data  Stream 
parallel  array  computer  in  which  the  node  processors  can 
be  individually  programmed  as  von  Neumann  machines.  Com- 
pilation and  assembly  of  programs  will  take  place  at  the 
sector  control  or  master  control  computer  levels. 

Programming  of  any  one  node  processor  through  the 
sector  or  master  computers  is  a fairly  simple  task  using 
FORTRAN,  APL  or  some  other  user-oriented  language.  How- 
ever, to  program  an  entire  array  without  requiring  pro- 
gram development  for  each  node  requires  that  procedure- 
oriented  programming  tools  be  employed. 

• To  implement  the  solution  of  a field  problem  in  a 
homogeneous  medium,  a single  program  describing  the 
medium  at  each  point  could  be  developed  on  the  master 
or  sector  control  computer  and  then  filled  into  the  NPs 
that  represent  the  medium.  If  mathematical  relationships 
can  be  formed  for  determining  position,  coefficients 
and/or  formulation  of  solutions  needed  at  boundaries  it 
would  be  possible  to  develop  boundary  programs  by  using 
a procedure  oriented  FORTRAN  or  APL  which  could  modify 
program  statements,  statement  coefficients,  and/or  loading 
position  into  the  array  based  on  known  relationships. 

If  such  relationships  are  difficult  to  state  or  do  not 
exist,  each  NP  or  set  of  NPs  at  a boundary  would  have  to 
be  independently  programmed.  In  similar  fashion,  pro- 
gramming the  array  computer  for  solution  of  the  field  in 
a non-homogeneous  medium  might  be  simplified  by  using 
known  relationships. 
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If  the  geometrical  configuration  of  the  boundaries 
of  the  PDE  equation  are  such  that  it  fits  well  into  a 
rectangular  array,  or  into  the  cylindrical  or  toroidal 
array  obtained  by  "wrap-around",  then  the  geometrical 
interconnection  problem  is  a simple  one.  If  the  shape 
does  not  fit  well,  one  of  these  array  forms  may  still 
be  satisfactory  when  used  with  the  aid  of  the  pass- 
through or  commutation  capability.  For  a computer  dedi- 
cated to  some  single  large  problem  (such  as  weather  or 
climate) , a special  array  might  be  set  up. 

Other  general-purpose  array  sturctures  might  be 
preferred  for  some  parallel  array  computers  of  the 
kind  described  here.  Thus  if  triangles  are  to  be  used 
in  place  of  squares,  each  node  processor  would  need  to 
have  six  nearest  neighbor  communication  channels  in  the 
two-dimensional  case,  and  twelve  in  the  three-dimensional 
case.  Although  the  number  of  communication  linkages  re- 
quired would  be  greater,  the  solution  of  some  problems, 
especially  those  related  to  finite  elements,  would  be 
simplified. 

It  should  be  noted  that  while  special  hardware  con- 
sisting of  extra  links  to  neighboring  processors  may  speed 
and  simplify  problem  solution  for  cases  where  triangles 
must  be  used,  software  can  be  used  to  effect  the  same 
information  transfers  for  a cubical  array  by  routing 
through  processors. 

To  solve  simulation  or  modeling  problems,  each  node 
could  be  assigned  to  do  the  calculations  associated 
with  one  state  variable.  In  system  modeling,  a transfer 
function  list  might  be  specified  along  with  an  inter- 
connection list.  The  master  or  sector  control  computers 
could  process  the  lists,  assigning  transfer  functions  to 
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available  nodes  and  constructing  interconnect  paths 
utilizing  the  commutation  capability  described  in  the 
preceding  section.  System  simulation  could  be  handled  in 
a similar  fashion  where  state  variables  would  be  assigned 
to  available  nodes  and  interconnection  paths  constructed 
by  software.  The  three-dimensional  structure  of  the 
WISPAC  greatly  simplifies  interconnection  path  definition 
for  two-dimensional  simulation  and  modeling  problems. 

Using  a processor  node  depth  of  20  levels  can  be  compared 
to  constructing  20-layer  printed  circuit  cards.  Simple 
path  construction  is  assured  in  most  practical  cases. 

For  the  general  simulation  case,  one  method  of 
overall  problem  control  requires  that  each  processor  indi- 
cate the  end  of  its  computation  cycle  by  setting  a flag. 

The  master  would  initiate  I/O  and  control  instructions 
after  all  node  processors  have  signified  that  they  are 
ready  for  the  next  iteration.  Another  more  versatile 
method  is  suggested  by  Miller  [22],  in  which  computation 
proceeds  upon  availability  of  data  tc  the  particular  node, 

i.e.  a node  "fires"  when  sufficient  data  is  present. 
Advantages  of  this  method  are  as  follows: 

1.  There  is  less  time  spent  idle  when  I/O  lists 
of  different  length  are  present 

2.  The  speed  will  not  be  less  than  the  hardware- 
lockstep  method,  and  may  often  be  greater. 

3.  If  software  passing  of  information  becomes  neces- 
sary, no  delay  is  encountered  due  to  the  lock- 
step  process. 

4.  With  the  possibility  of  data-available  computation, 
it  becomes  possible  to  have  more  than  one  user 
efficiently  occupying  space  in  the  array  at  a 
given  time  (multi-programming  as  well  as  multi- 
processing) . 
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Given  the  ability  of  each  processor  to  process  data 
out  of  step,  the  setting  up  of  accumulation  paths  as  well 
as  software  passing  of  information  becomes  easily  possible. 
Such  paths  are  also  necessary  in  matrix  multiplication 
and  other  applications,  to  overcome  the  limitation  pre- 
sented by  the  number  of  inputs  available  at  a particular 
node. 


In  order  to  perform  the  previously  stated  tasks 
various  specialized  software  would  have  to  be  developed 
to  ease  user  programming  through  the  master  or  sector 
computers.  The  following  system  software  would  be  re- 
quired : 

(1)  Language  Processor 

A high  level  language  is  needed  which  is  capable  of 
handling  files  and  programs  in  a manner  similar  to 
subscripted  variables.  This  type  of  language  is 
necessary  due  to  the  large  number  of  programs  which 
are  manipulated  and  placed  in  the  NPs.  Such  language 
processors  have  been  implemented  on  serial  machines 
in  the  past. 

(2)  Cross  Compiler 

The  task  of  this  software  entity  is  to  accept  pro- 
grams in  some  user-oriented  language  and  translate 
them  to  machine  language  programs  for  the  node  pro- 
cessors. 

(3)  Loader 

The  most  complicated  part  of  the  systems  software 
for  the  proposed  MIMD  configuration  will  be  the  loader 
routines.  The  loader  will  be  responsible  for; 
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1.  Allocation  of  solution  areas  in  the  array  of 
processors. 

2.  Maintenance  of  available  and  unavailable  processor 
lists . 

3.  Actual  depositing  of  programs  into  array  memories. 

Pattern  recognition  techniques  may  be  necessary  to 
fit  the  maximum  number  of  jobs  into  the  solution  array. 

If  the  master  or  sector  computer  maintains  lists  of  as- 
signed and  unassigned  processors,  it  may  place  many  jobs 
into  the  array  at  any  given  time,  just  as  many  users  are 
allocated  memory  in  today's  timesharing  systems.  Even  in 
the  instances  of  a single-user  installation,  limited  use 
of  these  techniques  may  be  necessary  in  order  to  allocate 
around  areas  of  hardware  failure  to  provide  for  graceful 
degradation.  Part  of  these  responsibilities  may  be  al- 
located to  the  operating  system. 

V.  PROGRAMMING  OF  AN  EXAMPLE 


The  simulation  study  of  the  dynamics  of  the  cardio- 
vascular system  [26]  is  often  carried  out  on  an  analog 
or  hybrid  computer,  in  order  to  achieve  real-time  or 
faster  operation  [27].  If  "multiple-modeling"  is  used 
[28] , the  pressure-flow  dynamics  may  be  set  up  on  the 
analog  part  of  a hybrid  computer,  and  the  slower  mass 
transport  dynamics  on  the  digital  part.  In  Fig.  6 (a) 
the  circuit  representation  of  the  discretized  fluid-flow 
equations  [26]  is  shown  for  a simple  model  of  the  systemic 
part  of  the  circulation.  Typical  equations  for  pressures, 
p,  and  flows,  f,  (for  part  of  the  aorta)  are: 

P5-P6  = R6f6  + L6 
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The  mass  transport  equations,  if  a perfect  mixing 
chamber  is  assumed,  corresponding  to  each  compliance,  is 
shown  in  Fig.  6(b).  Typical  equations  for  concentrations 
Y,  and  mass  flows  f*  are: 
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P6C6  + q6u 


where  qfi*  (0)  is  the  initial  mass  in  the  compartment, 

and  q^u  is  the  unstressed  blood  volume  in  the  corresponding 

arterial  segment. 

The  pressure-flow  dynamics  of  Fig.  6(a)  may  be  set 
up  in  a two-dimensional  part  of  an  array  computer  such 
as  the  WISPAC,  with  one  NP  assigned  to  each  state  variable, 
as  shown  in  Fig.  5(c),  with  help  from  a strong  topological 
correspondence.  Note  that  the  NP  which  yields  the  state 
variable  p^  also  gives  f ^ , which  is  not  a state  variable 
since  it  is  given  by  f^  = (p^-p^ ) /R^ . 

Note  too  that  pass-through  is  needed  in  this  case 
to  bring  all  four  branch  connections  to  node  Pg.  If 
more  than  four  branches  join  at  a node  the  two-dimensional 
form  can  only  be  used  if  some  more  involved  pass-through 
techniques  are  used. 

The  mass  transport  model  of  Fig.  6(b)  may  be  pro- 
grammed on  the  parallel  array  computer  as  shown  in  Fig. 
5(d),  where  a topological  resemblance  is  again  seen. 

This  model  requires  inputs  from  the  pressure-flow  model  of 
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Fig.  6(c),  and  in  a three-dimensional  computer  could 
lie  in  a plane  directly  above  or  below  it  so  that  inter- 
connections will  be  simple.  In  modeling  of  the  transport 
of  more  than  one  kind  of  material,  models  of  the  form 
of  Fig.  6(d)  may  be  stacked  in  parallel  planes. 

With  proper  software  development  the  interconnection 
and  data  transfer  routines  could  be  pre-determined  by  a 
program  in  the  master  computer,  as  suggested  in  the  pre- 
ceding section. 

VI.  CONCLUSIONS 

The  proposed  computer,  both  as  regards  hardware  and 
software,  does  not  require  any  particularly  new  or  dif- 
ficult development.  Error  correction,  fault  tolerance 
and  fault  location  will  offer  problems  which  must  receive 
much  attention  particularly  if  any  large  array  machine  of 
this  type  is  built. 

Initial  studies  will  be  made  on  a 3-microprocessor 
(plus  master  computer)  machine  now  being  built. 
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Fig.  1 Node  processor  diagram  showing  nearest  six 
neighbor  linkages. 
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Fig.  2 Relationship  of  one  node  processor  (NP)  to  the 
sector  control  computer  (SCC)  showing  (1)  the 
sector  memory  communication  bus,  (2)  the  sector 
control  bus,  (3)  the  sector  routing  control 
lines  and  (4)  the  external  sector  access  bus. 
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Fig.  3 Array  Processor  block  diagram  showing  control 
computer  interconnections. 


Fig.  4 TYPES  OF  COMMUTATION  OPERATIONS 


Fig  6 (a)  Circuit  representation  of  a discretized  model 

of  the  systemic  part  of  the  blood  circulation 
system  (pressure-flow  dynamics). 

(b)  Compartment  representation  of  a mass-trans- 
port model  corresponding  to  and  driven  by  the 
model  in  (a) . 

(c)  WISPAC  2-dimensional  routing  program  for  the 
pressure-flow  model  of  (a) . 

(d)  WISPAC  routing  program  for  the  mass-transport 
model  of  (b) , arranged  to  lie  in  a plane  directly 
above  or  below  the  set-up  in  (c) . 
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SOFTWARE  FOR  INTERVAL  ARITHMETIC: 
A REASONABLY  PORTABLE  PACKAGE 


J.  M.  YOtlE 


1.  Introduction:  One  means  of  bounding  the  error  in  digital  computation 

is  through  the  use  of  interval,  or  range,  arithmetic;  instead  of  computing 
with  approximate  re3l  numbers,  one  calculates  with  pairs  of  approximate 
real  numbers  --  the  first  member  of  a pair  being  a lower  bound  for  the 
true  result,  and  the  second  an  upper  bound.  By  this  method,  one  can  take 
into  account  such  varied  sources  of  error  as  uncertainty  in  input  data, 
inaccuracies  in  mathematical  formulae,  and  errors  in  approximation  of  real 
numbers  and  the  operations  on  them.  The  theory  of  interval  arithmetic  is 
developed  extensively  elsewhere  [5];  we  shall  not  treat  it  here. 

The  major  obstacle  to  the  use  of  interval  arithmetic  is  the  unavail- 
ability of  software.  INTERVAL  is  not  a standard  data  type  in  any  produc- 
tion language  that  we  know  of;  preparation  of  a package  of  subprograms  to 
handle  interval  data  is  a nontrivial  task.  Since  the  representation  of  and 
operations  on  interval  data  are  necessarily  rather  heavily  dependent  upon 
the  architecture  of  the  host  computer,  a package  developed  for  one  system 
can  not,  in  general,  be  moved  intact  to  a different  system. 

This  paper  describes  an  interval  arithmetic  package  for  use  with 
FORTRAN.  The  power  of  the  AUGMENT  precompiler  [2,3]  is  employed  to  render 
the  major  part  of  the  package  independent  of  specific  data  representations, 
and  the  package  is  so  designed  that  the  parts  which  are  representation- 
dependent  are  concentrated  in  a relatively  small  number  of  modules,  most 
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of  which  are  easily  adapted  to  new  environments. 


1 


Although  this  package  was  written  for  the  UNIVAC  1110,  it  has  also 
been  implemented  on  IUM,  DEC,  Honeywell,  and  CDC  equipment.  No  major 
problems  have  been  reported  in  transporting  the  package  to  these  host 
systems, 

The  information  given  in  this  paper  is  not  intended  to  be  exhaustive. 
The  interested  reader  will  find  detailed  information  on  all  aspects  of 
the  package  in  the  technical  manual  [9], 


The  viewpoint  taken  was  that  of  the  end  user.  Wo  sought  to  make 
the  puckagc  complete , accurate,  convenient  to  use,  fail-safe , and 
transportable. 

Completeness : All  appropriate  ANSI  Standard  Fortran  [ 1 ] operations 
and  functions  were  implemented,  along  with  some  (such  as  tangent,  hyper- 
bolic sine,  and  hyperbolic  cosine)  which  are  not  ANSI  Standard  but  are 
normally  implemented  in  the  FORTRAN  language  anyhow.  Since  interval 
numbers  can  be  regarded  in  a natural  sense  as  belonging  to  an  extension  of 
the  real  number  system,  most  arithmetic  operations  and  special  functions 
are  meaningful.  In  addition,  there  arc  a large  number  of  functions  peculiar 
to  interval  arithmetic  (such  as  union  and  intersection  of  intervals,  mid- 
point, and  hal f- length)  which  were  also  included  in  the  package.  Finally, 
input/output  routines  and  conversions  between  intervals  and  standard  data 
types  (where  appropriate)  were  implemented.  A list  of  the  functions  and 
operations  is  given  in  Table  2.1. 

Accuracy:  It  is  well  known  that  error  is  inherent  in  digital  computa- 
tions, and  that  most  computer  architectures  arc  less  than  optimal  from  this 
point  of  view.  (Recently,  there  lias  been  increased  interest  in  developing 
more  hospitable  architecture;  see,  for  example,  bang  and  Shrivcr  [ 4 ] 
and  Ris  [6].)  Moreover,  it  is  extremely  difficult,  if  not  impossible,  to 
obtain  the  information  required  for  rigorous  bounding  of  hardware  operations. 
Since  interval  arithmetic  tends  to  be  pessimistic  anyhow,  we  felt  that  the 
calculation  of  bounds  through  straightforward  application  of  a priori 
estimates  such  as  Wilkinson's  [7]  would  lead  to  intolerable  inaccuracy. 

In  addition,  vital  information  concerning  such  phenomena  as  exponent  range 
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faults  is  generally  not  available  in  existing  systems.  Consequently,  the 
package  was  designed  to  be  based  on  a set  of  aritlimetic  primitives  of  the 
type  described  in  [ 8 ] . 

The  special  functions  pose  a different  problem.  A straightforward 
application  of  interval  arithmetic  to  the  algorithms  used  to  compute  these 
functions  will  yield  unacceptably  wide. intervals,  due  to  the  dependency 
problem  |5  ].  We  addressed  this  problem  by  employing  higher  precision 
functions,  bounding  the  results  on  the  basis  of  accuracy  information  pro- 
vided by  the  software  supplier.  This  must  be  regarded  as  being  less  titan 
completely  satisfactory,  since  available  error  information  is  often  sketchy 
and  may  not  be  completely  rigorous;  however,  the  bounding  procedure  takes 
these  disadvantages  into  account,  and  the  results  can  be  regarded  3S  being 
valid  with  extremely  high  probability. 

Like  the  aritlimetic  routines,  input/output  routines  need  to  be  written 
from  the  ground  up.  Conversion  routines  supplied  with  standard  FORTRAN 
systems  have  r.o  provisions  for  obtaining  the  required  bounds;  moreover, 
most  of  them  are  of  unknown,  if  not  dubious,  accuracy. 

Convenience : By  itself,  no  collection  of  routines  to  perform  non- 
standard arithmetic  is  really  convenient  to  use.  Each  operation  must  be 
performed  by  a call  on  one  of  the  subprograms  in  the  package;  this  means 
that  the  user  must  parse  every  expression  himself  and  write  his  program  in 
what  amounts  to  assembly  language.  The  best  that  can  be  done  in  this 
setting  is  to  minimise  the  inconvenience.  To  this  end,  we  have  kept  the 
package  as  internally  consistent  as  possible.  All  entry  points  to  the 
package  bear  the  prefix  INT;  routines  used  by  the  package  itself  are  pre- 
fixed with  INT  or  UFA,  according  to  their  level.  Thus,  by  avoiding  variable 
and  subprogram  names  beginning  with  these  prefixes,  the  user  may  be  assured 
of  avoiding  conflicts. 
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Calling  sequences  for  the  routines  in  tho  package  arc  consistent  and 
concise.  No  information  is  transmitted  which  is  not  absolutely  essential 


to  the  function  being  performed.  Where  the  result  of  a function  or  operation 
is  a standard  data  type,  the  routine  is  implemented  as  a function  of  that 
type;  otherwise,  the  routine  is  implemented  as  a subroutine.  In  the  former 
case,  the  arguments  are  simply  the  operands;  in  the  latter  case,  the  arguments 
are  the  operands  together  with  the  result.  (In  the  interest  of  flexibility, 
the  endpoints  of  an  interval  are  regarded  as  being  a nonstandard  data  type.) 

Convenience  of  use  of  any  nonstandard  data  type  is  increased  dramatically 
by  the  use  of  an  appropriate  precompiler.  This  package  is  specifically  de- 
signed to  be  used  with  the  AUGMENT  precompi ler , which  allows  the  source 
FORTRAN  code  to  be  written  as  though  FORTRAN  recognized  INTERVAL  as  a stan- 
dard data  type.  In  this  case,  just  as  above,  the  user  must  avoid  conflicts 
with  the  package;  although  the  source  code  will  not  contain  references  to 
the  routines  of  the  package,  the  output  from  AUGMENT,  of  course,  will.  In 
addition,  the  user  must  also  avoid  the  function  names  and  operators  shown 
in  the  table,  since  these  become  reserved  words  in  the  extension  of  FORTRAN. 

In  most  cases,  this  should  not  be  an  onerous  task. 

Fail-safe:  Errors  can  occur  in  many  of  the  operations  of  the  interval 
package,  just  as  they  can  in  REAL  operations.  It  is  our  viewpoint  that  errors 
should  not  be  ignored.  Each  subprogram  in  which  an  error  can  occur  will 
call  the  error-handling  routine,  INTRAP,  prior  to  returning  control  to  the 
calling  program.  If  no  error  has  occurred,  INTRAP  simply  returns  control 
to  the  routine  which  called  it.  Otherwise,  INTRAP  takes  the  action  specified 
by  a table  which  resides  in  a COMMON  block;  the  response  depends  on  the 
error  which  lias  occurred,  but  usually  includes  a print-out  whicli  gives  the 
user  complete  information  on  the  error.  The  user  may,  if  he  chooses,  alter 
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the  response  by  changing  the  table. 


Transportability:  Transportability  and  flexibility  of  representation 
are  closely  linked.  The  package  is  based  on  three  data  types:  BPA  (rancinonic 
for  Best  Possible  Answer),  which  is  the  data  type  of  the  endpoints  of  inter- 
vals, but  is  otherwise  undefined  except  in  a few  primitive  routines;  INTERVAL, 
which  is  defined  to  be  a BPA  array  of  length  2;  and  EXTENDED,  which  is  the 
data  type  in  which  evaluations  of  special  functions  are  performed.  In  the 
UNI VAC  version  of  the  package,  the  representation  of  BI'A  is  the  same  as 
that  of  REAL,  and  EXTENDED  is  a synonym  for  DOUBLE  PRECISION. 

The  AUGMENT  precompiler  is  used  to  extend  the  representations  of  these 
nonstandard  data  types  throughout  the  package.  The  output  of  the  AUGMENT 
precompiler  is  a set  of  routines  which,  apart  from  the  arithmetic  primitives 
which  are  written  in  assembly  language,  conforms  as  closely  as  possible  to 
ANSI  Standard  FORTRAN , 

There  are  less  than  twenty  program  modules  which  depend  on  the 
representations  of  BPA  and  EXTENDED  numbers;  many  of  these  will  need  no 
alteration  fer  most  applications.  Adaptation  of  the  package  to  other 
hardware  is  discussed  more  fully  in  Section  4. 
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3.  Use  of  the  INTERVAL  package: 


If  used  as  a collection  of  subroutines,  without  the  benefit  of  the 
AUGMENT  precompiler,  the  INTERVAL  package  is,  of  course,  used  just  as 
any  package  of  subprograms  would  be  used.  That  is,  the  user  must  decide 
which  routines  must  lie  invoked  and  in  what  order.  We  prefer  to  regard  the 
INTERVAL  package  as  an  extension  of  the  capabilities  of  the  host  computer 
system,  and  the  AUGMENT  precompiler  as  an  instrument  for  extending  FORTRAN 
to  take  advantage  of  the  additional  power.  Thus  we  will  address  the 
question  of  use  of  the  package  in  this  context;  the  user  who  by  reason  of 
preference  or  necessity  does  not  use  the  precompiler  will  have  no  difficulty 
in  adapting  this  discussion  to  his  needs. 

Type  declarations  for  INTERVAL  variables;  if  x,  Y,  and  Z represent 
INTERVAL  variables,  they  must  be  declared  as  such  by  the  statement 

INTERVAL  X,  Y,  Z 

INTERVAL  variables  may  be  dimensioned;  the  only  restriction  is  that  if  the 
FORTRAN  compiler  limits  the  number  of  dimensions  of  an  array,  that  limit 
must  be  decreased  by  1 for  INTERVAL  variables.  The  reason  for  this  is 
that  AUGMENT  will  declare  INTERVAL  variables  as  arrays. 

Assignment  of  values  to  INTERVAL  variables:  Most  real  numbers  can  not 
be  represented  exactly  in  the  computer.  The  error  inherent  in  a statement 
such  as 

X - .1 

may  not  be  immediately  obvious.  If  X is  an  INTERVAL  variable,  the  above 
statement  will  assign  a value  to  X,  but  that  value  will  not,  in  general, 
be  an  interval  containing  the  real  number  .1.  In  order  to  set  X to  an 
interval  which  does  contain  .1,  one  may  write 

X ■ '(.1,  .!)$',  or  X - 9li (.1,  .!)$  if  the  host  compiler 


244 


does  not  accept  quoted  Hollerith  literals.  If  the  host  compiler  generates 
a sentinel  for  a Hollerith  literal,  and  if  the  UNPACK  primitive  recognizes 
that  sentinel,  the  terminal  $ may  be  omitted.  Any  string  that  is  legal  input 
for  the  formatted  read  (see  discussion  below  and  Appendix  1)  is  also  accept- 
able to  the  routine  which  performs  this  conversion.  Thus,  on  the  UNIVAC 
1110,  the  statement 
X » '.l* 

would  also  have  the  desired  effect. 

Reading  INTERVAL  variables:  Two  options  are  available  in  this  package: 
a free  format  read  and  a formatted  read. 

The  free  format  read  will  obtain  the  next  data  field  from  the  input 
stream  on  the  specified  unit,  convert  it,  and  store  the  result  in  the 
specified  INTERVAL  variable.  The  calling  sequence  is 
CALL  INTRDF (UNIT,  X) 

The  basic  package  will  recognize  units  5 (standard  input)  and  0 (reread) . 
but  the  user  may  add  other  units  or  change  unit  designations  as  desired; 
this  is  discussed  in  the  technical  documentation.  A data  field  may  be 
any  legal  representation  of  an  interval  variable  (see  Appendix  1);  however, 
for  simplicity,  ore  may  be  assured  that  the  format  (number,  number),  where 
nunbor  is  any  legal  FORTkAN  string  representing  an  integer,  fixed  point 
number,  or  floating  point  number,  is  always  valid.  Lmbedded  blanks  between 
matching  parentheses  are  always  ignored.  Fields  may  be  separated  by 
blanks  (as  many  as  desired),  although  if  intervals  are  enclosed  in  parenthe- 
ses as  indicated  above,  blanks  are  unnecessary.  Fields  may  be  continued 
across  card  boundaries.  The  input  stream  remains  uninterrupted  so  long  as 
all  reading  is  done  by  INTRDF  and  the  unit  number  does  not  change.  Once  the 
input  stream  has  been  interrupted,  INTRDF  begins  a new  input  stream  with 
a new  record. 
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The  formatted  read,  as  its  name  implies,  reads  interval  data  according 
to  c specified  format.  This  routine  reads  a vector  of  values  (which  may 
be  of  length  1).  The  calling  sequence  is 
CALL  INTRD(UNIT,  FMT,  A,  N) 

Unit  is  as  in  the  free  format  read;  A is  the  first  location  of  the  vector 
into  which  the  data  is  to  be  read;  and  N is  the  length  of  the  vector. 

FMT  is  an  array  of  length  3;  FMT(l)  is  the  number  of  data  items  per  record, 
FMT(2)  is  the  number  of  characters  to  be  ignored  before  each  data  field,  and 
FMT C 3)  is  the  width  of  each  data  field.  Note  that  these  values  are  con* 
stant  for  each  call  to  INTRO.  A data  field  may  be  any  legal  representation 
of  an  interval  variable;  parentheses  are  optional,  and  embedded  blanks  are 
permitted.  No  other  information  is  permitted  within  a data  field. 

Computing  with  INTERVAL  variables:  Expressions  involving  INTERVAL 
variables  are  written  in  standard  FORTRAN  syntax,  just  as  though  INTERVAL 
were  a standard  FORTRAN  data  type.  A list,  of  the  operations  and  functions 
available  in  this  package  may  be  found  in  Appendix  2. 

Mixed  mode  expressions  are  permitted,  but  their  use  is  discouraged  due 
to  the  high  probability  of  introducing  hidden  error.  For  example,  the  expres- 
sion 

y - o.l  * x 

where  X and  Y are  INTERVAL  variables,  will  not  yield  a correct  value  of  Y; 

0.1  will  first  be  converted  to  REAL  by  the  compiler,  and  AUGMENT  will  then 
cause  that  REAL  number  to  be  converted  to  a degenerate  interval  not  contain- 
ing .1.  Multiplication  will  then  occur  using  this  erroneous  interval. 

Other  operators  and  functions  peculiar  to  interval  arithmetic  are 
implemented;  examples  include  the  intersection  of  two  intervals,  the  union 
of  two  intervals,  derivation  of  the  midpoint  and  half-length,  etc.  These 
are  listed  in  Appendix  2.  Relational  operators  are  also  implemented,  but 
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they  take  on  different  meanings  in  the  context  of  interval  arithmetic; 
see  Appendix  2 for  details. 

Writing  INTERVAL  variables:  The  write  routine  will  convert  a vector 
(possibly  of  length  1)  of  INTERVAL  variables  to  external  format  and  write 
it  on  the  specified  output  unit  according  to  the  given  format.  The 
external  representation  of  each  interval  is  guaranteed  contain  the 
interval,  and  is  the  smallest  interval  representable  in  the  given  format 
which  does  so.  The  calling  sequence  is 
CALL  INTWR(UNIT,  FMT,  A,  N) 

The  basic  package  will  recognize  units  6 (standard  printer)  and  1 (standard 
punch),  but  again  the  user  may  change  designations  and/or  add  units  at  will. 
If  an  illegal  output  unit  is  specified,  INTWR  will  use  the  standard  printer 
instead. 

FMT  is  now  an  integer  array  of  length  4.  The  first  three  values  are 
the  same  as  for  INTRD  (except  that  ignored  characters  in  the  output  record 
are  filled  with  blanks);  FMT ( 4 ) is  a carriage  control  character  for  use 
where  appropriate.  This  character  must  bo  either  'O'  or  ' denoting 
double  spacing  or  single  spacing,  respectively.  The  width  of  each  data 
field  specified  bg  the  format  must  be  at  least  great  enough  to  permit  the 
package  to  convert  one  significant  digit/  in  the  1110  version,  this  is  lb 
characters,  assuming  a 2-digit  exponent.  Add  2 characters  for  each  addition- 
al exponent  digit  in  the  external  format.  If  an  illegal  format  is  speci- 
fied, the  routine  will  default  to  a standard  format. 

A and  N arc  as  in  the  formatted  read. 

Errors : The  package  is  designed  to  detect  all  errors  as  they  occur. 

The  user  may  elect  any  of  the  available  responses  for  any  possible  error 
(See  Appendix  3);  however,  the  default  response  is  to  print  an  error  message 
and  halt  the  computation  except  in  those  cases  where  viable  alternatives 
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exist.  Those  cases  are  few  indeed;  they  comprise  arithmetic  underflows 
(where  the  offending  value  is  set  either  to  zero  or  to  the  properly- 
signed  number  of  smallest  magnitude,  as  appropriate)  and  errors  occurring 
on  output  (whero  the  write  routine  uses  standard  modes  of  output  rather 
than  electing  to  scrub  the  computation  and  lose  the  output  altogether). 

In  the  former  case,  the  computation  proceeds  without  notice  to  the  user; 
in  the  latter  case,  a message  is  printed  after  the  output  is  complete. 

The  method  of  changing  the  default  responses  to  errors  is  discussed  in  the 
technical  documentation. 

Producing  an  object  program:  Unless  a sophisticated  job  control 
language  allows  for  an  automatic  (from  the  user’s  point  of  view)  invocation 
of  the  AUGMENT  precompiler  and  the  FORTRAN  compiler,  the  generation  of 
an  object  program  is  a two-step  procedure: 

1.  Use  AUGMENT  to  translate  the  source  program  into  a FORTRAN  program 
compatible  with  the  compiler.  This  can  be  accomplished  with  a run  stream 
of  the  following  type: 

invoke  AUGMLNT 

description  decks  for  BPA  and  INTERVAL  (supplied  with  the  package) 

‘BEGIN 

source  program 

♦END 

AUGMENT  will  write  the  translated  program  on  Unit  20. 

2.  Compile  the  output  of  AUGMENT  using  the  standard  FORTRAN  compiler 
and  execute  the  resulting  program  in  the  usual  manner.  The  user  must  insure 
that  the  BLOCK  DATA  modules  arc  included  when  the  program  is  processed 
by  the  linkago  editor. 


248 


Adaptation  of  the  package  to  other  hardware  is  not  difficult  provided  one 
has  access  to  the  AUGMENT  precompiler.  The  necessary  steps  are: 

1.  Decide  on  data  representations  for  the  interval  endpoints  and 
for  EXTENDED  precision  numbers. 

2.  Code  or  revise  primitives,  as  necessary. 

3.  Process  the  package  through  the  AUGMENT  precompiler  and  compile 
the  resulting  FORTRAN  code, 

4.  Check  the  package. 

5.  Tune  and  recheck  the  package. 

Wo  discuss  each  of  these  steps  in  greater  detail. 

Data  representations:  Normally,  the  representation  for  interval  endpoints 
will  be  the  same  as  REAL  and  EXTENDED  will  be  the  same  as  double  precision. 
These  choices  will  simplify  the  adaptation  of  the  package;  however,  for  special 
purposes  such  as  higher  precision  interval  arithmetic,  other  choices  may  be 
made.  There  are  several  implicit  assumptions  which  will,  to  a certain  extent, 
govern  the  choices  of  representations; 

a.  The  portion  of  the  package  which  performs  endpoint  evaluations 
(known  as  type  BPA)  will  contain  explicit  routines  to  perform  all 
operations.  As  designed,  it  is  assumed  that  conversion  from  BPA  to  REAL 
is  exact,  although  conversion  in  the  other  direction  need  not  be,  This 
is  done  to  facilitate  adaptation  to  two's  complement  hardware,  where  the 
negative  of  a real  number  is  not  necessarily  representable;  we  assume  that 
the  negative  of  every  BPA  number  is  representable. 

b.  It  is  assumed  that  EXTENDED  is  bound  to  a higher  precision  than 
is  BPA.  Moreover,  we  assume  that  every  BPA  number  and  every  FORTRAN 
integer  can  be  represented  exactly  in  EXTENDED  format.  For  the  evaluation 
of  special  functions,  we  assume  that  a complete  supporting  package  exists 
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for  type  EXTENDED,  and  that  bounds  on  the  accuracy  of  these  routines 
are  available. 

Primitives : There  are  nineteen  primitives  which  depend  on  the  represen 


tation  of  BPA  and  EXTENDED  numbers  in  the  host  system.  Two  of  these  are 
BLOCK  DATA  modules,  which  contain  various  representation  dependent  constants; 
eight  are  written  in  FORTRAN  and  depend  only  on  BPA  format  being  the  same 
as  REAL  and  EXTENDED  being  the  same  as  DOUBLE  PRECISION;  three  depend  on  both 
data  representations  and  the  (nonstandard)  FLD  function;  one  contains  FORMAT 
statements  which  may  be  representation  dependent;  and  five  are  arithmetic 
primitives  which  must  necessarily  be  recoded  for  any  change  in  data  representa- 
tion. The  arithmetic  primitives  are,  in  fact,  written  in  assembly  language. 

In  addition,  1NTRD  and  IN'TRDF,  while  not  technically  primitives,  contain 
nonstandard  READ  statements  which  recognize  the  END  OF  FILE  condition.  If  the 
host  compiler  does  not  recognize  this  form  of  READ  statement,  those  statements 
will  need  to  be  modified. 

Complete  documentation  of  these  primitives  is  given  in  the  technical 
manual.  It  does  not  seem  appropriate  to  go  into  greater  detail  here. 

AUfiMENT  processing:  The  use  of  the  AUGMENT  precompiler  preserves  both 
naturality  of  expression  and  flexibility.  Most  of  the  INTERVAL  package  is 
written  in  terms  of  the  nonstandard  types  BPA,  EXTENDED,  and  INTERVAL.  The 
binding  to  specific  data  representations  is  accomplished  through  the  primitives, 
and  these  bindings  are  extended  through  the  remainder  of  the  package  by  the 
use  of  AUGMENT.  Every  effort  has  been  made  to  write  the  package  so  that  the 
output  of  the  AUGMENT  precompiler  will  be  ANSI  Standard  FORTRAN.  There  is 
no  requirement  that  AUGMENT  be  available  on  the  target  computer;  the  pre- 
processing can  just  as  well  be  done  on  any  computer,  with  the  resulting 
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FORTRAN  code  being  brought  to  the  target  system  for  compilation. 

Checking  the  package;  A collection  of  test  programs  is  provided  with 
the  INTERVAL  package.  Successful  execution  of  these  programs  is  reasonably 
good  assurance  that  the  primitives  have  been  implemented  properly. 

Tuning  the  package:  The  price  paid  for  the  degree  of  flexibility 
present  in  the  source  code  for  this  package  is  quite  likely  to  be  decreased 
efficiency  in  the  object  code.  For  example,  since  the  format  of  SPA 
numbers  is  arbitrary,  conversion  from  REAL  to  BI'A  will  generate  a call  on  a 
subprogram  which  is  responsible  for  performing  this  task  (this  subprogram 
is,  of  course,  a primitive).  If  BPA  numbers  are  the  same  as  REAL,  this  will 
result  in  unnecessary  overhead;  an  in-line  replacement  operation  would 
perform  the  same  task  at  considerably  less  cost.  AUGMENT  can  not  be 
instructed  to  make  this  modification;  thus,  for  greatest  efficiency, 
it  will  be  necessary  to  examine  the  output  of  AUGMENT  and  replace  calls 
of  this  type  by  in-line  replacement  statements.  There  are,  of  course, 
many  other  possibilities,  depending  on  representation;  for  example,  if  the 
hardware  has  double  precision  capability,  one  could  change  calls  on  the 
interval  replacement  subroutine  to  in-line  replacement  statements  using  the 
double  precision  hardware. 

A certain  amount  of  care  must  be  exercised  in  tuning  the  package. 

For  example,  the  routines  which  evaluate  BPA  relational  operators  call  on 
the  BPA  subtract  routine.  This  should  not  be  altered  unless  the  hardware 
subtract  always  produces  a result  of  the  same  sign  as  the  true  result,  even 
in  cases  of  underflow  and  overflow.  If  the  hardware  sets  an  underflow  to 
zero,  or  gives  garbage  when  overflow  occurs,  then  the  hardware  subtract  must 
not  be  used. 

Needless  to  say,  the  package  must  bo  rechecked  whenever  any  changes  are 


made 
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5.  Conclusion: 


In  this  paper,  we  have  sketched  the  design  and  use  of  a package  for 
performing  calculations  in  interval  arithmetic.  The  package  is  both 
flexible  and  transportable;  adaptation  of  the  package  to  other  systems 
can  be  accomplished  by  rewriting  a maximum  of  nineteen  primitive  modules, 
most  of  which  are  easily  adapted  to  a new  host  system.  Further  details  of 
the  package  are  provided  in  the  technical  documentation. 
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APPENDIX  1 


STANDARD  FORTRAN  NUMBER  AND 
INTERVAL  NUMBER  REPRESENTATIONS 
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SIGN 

| f a 
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INTEGER 

• | n 

NULL| <SIGN> I <INTEGER>  <DIGIT> 

RADIX 
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• 

FIXEDPOINT 
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<INTEGER><RADIX> | <FIXEDPOINT> <DIGIT> 

EXPSEP 

• t a 

e|d 

EXPONENT 

1 l« 

<SIGN> | <EXPSEP> | <EXPSEP>  <SIGN> | <EXPONENT> <DIGIT> 

NUMBER 

11“ 

<INTEGER> | <FIXEDPOINT> | <INTEGER><EXPONENT> | 

<FIXEDPOINT> <EXPONENT> 

ENDPTSEP 

t | a 
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COMMA 

| | a 

t 

INTERVAL 

| | a 

<NUHBER> j (<NUMBES>) 1 <NUMBER><ENDPTSEP><NUMBEK> j 

( <NUMBER>  <ENDPTSEP>  <NUMBER> ) | ( <NUMBER>  <COMMA> <NUMBER>) 


INTERVAL  INPUT  RULES 


FORMATTED  INPUT t 

One  and  only  one  <INTERVAL>  shall  appear  in  any  one  field. 

Embedded  blanks  are  permitted;  they  will  be  ignored, 

FREE  FORMAT  INPUT t 

Leading  blanks  are  always  ignored. 

Blanks  within  matching  pairs  of  parentheses  are  always  ignored. 

Commas  within  matching  pairs  of  parentheses  are  regarded  as  endpoint 
separators. 

A field  consists  of  exactly  one  <INTERVAL>. 

A field  is  terminated  by 

1.  A visible  blank; 

2.  Any  of  the  characters  1 # 1 , '«•*/ 

3.  A comma  occurring  outside  of  a matching  pair  of  parentheses; 

4.  Any  nonblank  character  following  a matching  right  paren- 

thesis (If  such  character  is  not  ' *#',  1 or  ' 

it  will  be  regarded  as  the  first  character  of  the  next 
field! ; 

5.  A left  parenthesis  or  colon  occurring  outside  of  a matching 

pair  of  parentheses.  (Such  character  will  be  regarded  as 
the  first  character  of  the  next  field!. 

If  a left  parenthesis  is  encountered,  the  scan  proceeds  to  the  matching 
right  parenthesis  regardless  of  what  characters  are  encountered, 
except  that  *$',  and  *■'  always  terminate  the  field. 


ALL  INPUT i 

A null  field  is  taken  to  represent  the  interval  (0,  0). 

A field  containing  <NUMBER>  or  (<NUMBER>)  is  taken  to  represent  a 

degenerate  interval;  this  number  is  converted  and  rounded  down 
for  the  loft  endpoint,  and  up  for  the  right  endpoint. 

If  a field  contains  two  <NUMBER>s,  the  first  will  be  converted  and 

rounded  down  for  the  left  endpoint,  and  the  second  will  be  con- 
verted and  rounded  up  for  the  right  endpoint. 
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RESULT  ROUTINE  INVOCATION  ROUTINE 

OPERATION  DEl'I  N I T I ON /EXP  LAN  AT  ION  TYPE  VIA  AUGMENT  DIRECT  TYPE 
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VARIABLE  NAMES  * The  first  letter  indicates  the  type  of  the  variable.  The  second  letter  is  R for 
RESULT,  A or  B for  argument;  other  letters  may  be  used  for  special  meanings. 
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FAULT 

ACTION 

NUMBER 

MEANING 

CODE 

0 

No  fault 

0 

1 

Left  endpoint  - no  fault  Right  endpoint  - 

overflow 

4* 

2 

no  fault 

infinity 

3 

3 

no  fault 

underflow 

0 

4 

overflow 

no  fault 

4* 

5 

overflow 

overflow 

4* 

6 

overflow 

infinity 

3 

7 

overflow 

underflow 

4* 

8 

infinity 

no  fault 

3 

9 

infinity 

overflow 

3 

10 

infinity 

infinity 

3 

11 

infinity 

underflow 

3 

12 

underflow 

no  fault 

0 

13 

underflow 

overflow 

4* 

14 

underflow 

infinity 

3 

15 

underflow 

underflow 

0 

16 

Division  by  zero 

3 

17 

Zero  raised  to  the  zero  power 

1 

18 

Square  root  of  a negative  number 

3 

19 

Logarithm  of  a nonpositive  number 

3 

20 

Underflow  during  computation  of  a BPA 

result 

0 

21 

Overflow  during  computation  of  a BPA  result 

3 

22 

Intersection  of  disjoint  intervals 

3 

23 

Arc  cosine  or  arc  sine  argument  out  of 

range 

3 

24 

Inverted  interval 

4 

25 

Illegal  input  character 

4 

26 

Illegal  input  format  specification 

4 

27 

Illegal  output  format  specification 

1 

28 

Input  string  too  long 

4 

29 

Illegal  or  unspecified  input  unit 

4 

30 

End  of  file  on  input  unit 

1 

31 

Illegal  or  unspecified  output  unit 

1 

32 

Conversion  array  overflow  during  base 

conversion 

4 + 

33 

Unrecognized  error 

4 

* Denotes 

that  the  fault  is  logically  impossible 

t This  action  should  not  bo  changed,  since  any  other  action  could  result 
in  a recursive  call  on  INTRAP  from  INTCXH. 

In  the  event  that  a fault  occurs,  the  corresponding  action  code  governs 
the  response  of  the  INTRAP  routine.  The  action  codes,  and  their  responses, 

are  i 

0 Return  to  the  calling  program  without  taking  any  action 

1 Print  error  message  and  return  to  the  calling  program 

2 Print  error  message,  trace  call  sequence,  and  return 

3 Print  error  message,  trace  call  sequence,  step  error 

countor  in  Executive  program,  and  return 

4 Print  error  message,  trace  call  sequence,  and  halt 

computation 
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AUTOMATIC  DIFFERENTIATION  OF  COMPUTER  PROGRAMS 


Gershon  K'edem 

University  of  Wisconsin  - Madison 
Mathematics  Research  Center 


ABSTRACT 

A method  for  the  automatic  differentiation  of  computer  functions 
(subroutines)  written  in  a high  level  language  is  discussed. 


A thoery  is  developed  to  show  that  most  functions  that  arise  in 
applications  can  be  differentiated  automatically.  It  is  shown  how 
one  can  take  a FORTRAN  function  (subroutine)  and,  with  the  aid  of  a 
precompiler,  obtain  a FORTRAN  subroutine  that  computes  the  original 
function  and  its  desired  derivatives. 

Implementation  of  two  types  of  differentiation  is  described: 

1)  Automatic  Taylor  series  expansion  of  FORTRAN  programs. 

2)  Automatic  Gradient  calculation  of  FORTRAN  functions. 
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AUTOMATIC  DIFFERENTIATION  OT  COMPUTER  PROGRAMS 

Gershon  Kedem 

1.  Introduction : 

The  use  of  reccurrence  relations  to  compute  derivatives  is  not  a new 
idea.  We  can  trace  this  idea  back  as  far  as  1932  when  it  was  used  to  com- 
pute the  Emden  functions  by  means  of  Taylor  series  expansion  by  J.  R.  Airy 
(See  [1]).  This  idea  has  apparently  been  rediscovered  many  times.  In  1964 
R.  E.  Moore  [7]  showed  how  one  could  automatically  get  Taylor  series  expan- 
sions of  FORTRAN-like  expressions  to  solve  initial  value  problems.  See  [1,2, 
5,7,8,10].  The  automatic  computation  of  partial  derivatives  was  implemented 
in  1967  by  A.  Reiter  and  J.  Gray  [11,14]  and  later  by  J.  Wertz  [12],  D.  Kuba  and 
L.  B.  Rail  [13],  and  R.  E.  Pugh  [9].  These  are  programs  known  to  us  but  the 
list  is  probably  not  complete. 

This  paper  suggests  a way  to  extend  the  process  to  functions  that  can 
be  written  in  an  algebraic  computer  language  (FORTRAN,  ALGOL  and  so  on) 
namely:  piecewise  factorable  functions. 

Let  us  look  at  computer  programs  for  the  evaluation  of  numerical  functions 
that  arise  in  applications. 

We  shall  use  the  FORTRAN  language  but  the  following  discussion  applies 
to  any  other  high  level  algebraic  language. 

We  look  at  FORTRAN  subroutines  and  functions  that  evaluate  mathematical 
functions.  We  assume  no  i/o  is  involved  and  that  "random  numbers"  are  not 
used.  All  such  procedures  have  a few  features  in  common: 

1)  For  every  set  of  values  of  the  formal  arguments  there  is  a fixed  finite 
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sequence  of  instructions  executed  (provided  that  these  values  are  within 
the  domain  of  definition  of  the  function  and  we  use  a correct  procedure). 

2)  If  we  regard  DO  loops,  logical  statements  and  GOTO  statements  only 
as  convenient  tools  for  defining  sequences  of  instructions,  we  see  that 
all  such  sequences  consist  of  arithmetic  operations  and  calls  to  library 
functions. 

3)  At  each  step  one  only  uses  previously  defined  values. 

4)  Most  of  the  functions  computed  are  piecewise  differentiable  or  actually 
piecewise  analytic. 

We  now  define  a mathematical  model  for  such  functions.  We  will  use  sub- 
scripts to  denote  sequences  and  superscripts  to  denote  components  of  vectors. 

2.  Definitions : 

i)  Let  £ be  a finite  set  of  real  functions  of  one  or  more  real  arguments  including 

the  identity  function.  We  call  £ the  set  of  basic  library  functions. 

Denote  by  iT^:Rn—  R the  projection  of  the  ith  coordinate,  that  is 
n.  1 n,  i 

IT.  (X  , . . . ,X  ) = X . 

ii)  A function  f:D  C Rn-*  R is  a factorable  function  if  there  exists  a finite  se- 
quence of  functions  f^,.  . . , f^  that  satisfy  the  following: 


a)  V 1 < j < k 


f.  : D -*  R. 
1 


b)  f = f^  , the  last  term  in  the  sequence. 

c)  f.  = ",  > f,  = %>•••>  f = " • 

i 1 2 2 n n 

d)  for  n < j < k f is  either  a composition  of  a basic  library 
function  with  one  or  more  functions  that  appeared  earlier  in  the 
sequence  or  f.  is  identically  constant. 
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iii)  We  say  that  the  sequence  f , . . . , f^  represents  f . 

iv)  We  call  the  sequence  f , . . . , f^  a Basic  Representation  of  f . 
Example  1 : 


£ = {+,  *,/,  Sin,  Cos,  Exp,  ...} 

f(x)  = Cos(x)  Exp(Sin(x))  + Sin2(x)  . 


A Basic  Representation  of  f: 

r n 

fl  = "l 
f2  =Cos(f1) 

f3  = Sinffj) 


f4  = Exp(f3) 

f5  = f3  * f3 
f6  = f2  * f4 
f7  = f5  + f6 


Definition : 

ns  j 

f:  DC  R -►  R is  factorable  if  f 1<  j < s are  factorable. 

Remarks : 

1)  Only  if  all  compositions  are  well  defined  can  we  call  f factorable. 

2)  A basic  representation  of  a function  is  not  unique. 

As  an  immediate  consequence  of  the  definitions  we  have 
Proposition  1 : 

If  all  the  basic  library  functions  in  the  representation  of  f are:  continuous, 
C , analytic,  ...  so  is  f . 

Proposition  2: 

The  composition  of  two  factorable  functions  is  a factorable  function. 
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The  following  lemma  is  somewhat  trivial  but  it  was  brought  in  order  to 


introduce  some  notation. 

Lemma: 

Assume  that  the  sequence  of  functions  f^f^,.  • • satisfy  the  following: 

a)  f , :D  C Rn -*  RS^,  that  is:  all  functions  have  the  same  domain. 

b)  For  each  i 1 < i < k is  either  a factorable  function  on  D or  f^ 

is  a composition  of  a factorable  function  with  one  or  more  functions  that 
appeared  earlier  in  the  sequence.  Or  L is  identically  constant. 

Then  for  every  1 < j < k f.  is  a factorable  function. 

Definition  ; We  say  that  the  sequence  f^,  . . . , f . represents  f and  we  call  the 
sequence  f^,  . . . ,f.  A Factorable  Representation  of  f , . 

Definition : 

A function  f:DCRn  -►  Rm  is  a piecewise  factorable  function  if  there  exists 

countable  number  of  sets  U.,  U , . . . such  that  DC  U U.  and  f restricted  to 

i c J 

each  U.  is  a factorable  function. 

1 

Examples : 

1)  Min(  • , • ) , Max(  • , •),  Abs(  ■ ) . 

2)  Spline  functions 

3)  : A- A'1  (A  N X N matrix)  . 

Remarks : 

In  practice  most  computer  programs  written  in  high  level  languages  like 
FORTRAN  and  ALGOL,  compute  (represent!  ) piecewise  factorable  functions.  One 
can  write  representations  of  such  functions  simply  by  following  the  path  of  execu- 
tion . 
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Differentiation  of  piecewise  factorable  functions 


We  now  turn  to  the  question  of  computing  derivatives  of  piecewise  factor- 
able functions.  We  start  with  an  example. 


Example  Z: 


We  take  £ and  f to  be  as  in  example  1. 
£ = {+,  *,/>Sin,  cos,  Exp,  . . . } 

f(x)  = Cos(x)Exp(Sin(x))  + Sin2(x)  . 


A Basic  Representation  of  f is: 


fl=  "l 


f2  = Sinffp 
f3  = Exp(f^) 
f4  = Costal 


f5  = f2  * fZ 
f6  = f4  * f3 
f7  = f6  + f5 

We  now  look  at  the  following  sequence  of  functions: 


f = TT 
1 1 


f2  - S‘"'V 

f3  = Exp(f^) 
f4  = Costfj) 

f5  = fZ  * fZ 
f6  = f4  * f3 
f7  = f6  + f5 


r 

I,  = t r 
1 2 


f2  = C°S(fl)  * fl 

A A 

f,  = f,  * f. 

3 3 Z 

f 4 = -Sinful  * fj 

A A A 

f5=  fZ*fZ+fZ*fZ 

A A A 

f6  = f4  * f3  + f4  * f3 

AAA 


^7  = ^6  ^ ^5 
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Please  note : 


a)  The  combined  sequence  (L,f  ) Is  a factorable  representation  of  a function 


from  R2  -«■  R2. 


b)  The  functions  f are  the  same  as  before  except  now  they  are  regarded  as 


functions  of  two  arguments. 


c)  There  is  a simple  correspondence  between  f.  and  f.,  that  is:  f only 

J J j 


depends  on  what  functions  are  composed  in  the  jth  stage. 


d)  It  is  not  hard  to  see  (or  prove  by  induction)  that  f? (xQ , 1 ) = f(xQ)  and 


f7(X0,1}  = dxf  , or  in  general  if  xQ  = x(tQ)  , y0  = 


VVV  = f(x(to))j  WV  = dF  f(x(t)) 


The  above  result  is  not  a great  surprise.  All  we  have  done  was  to  system- 


atically use  the  chain  rule  and  the  fact  that  we  know  the  derivatives  of  the  basic 


library  functions. 


Example  3 : 


1 dJ  X 

Let  [X],  denote  77  7 

] h dr 


If  one  knows  [X]^,  [Y]  j=0,l,...  and 


Z = X * Y then  one  can  compute  [Z],  by: 


[Z],  = [X],  ± [Y],  j = 0,1,...  . 


If  Z = X * Y then  by  Libniz  rule 

j 

[ZL  = £ [X]  * [Y] 

> i-~n  > >' 


J=0,1,...  . 


If  Z = Exp(X)  then 


[z]  = Z (U-k)/j)  * [Z],  * [X] 

> u _ r\  **  J 


j =1,2,.. 


j k=o  k j'k  ' 

w that  there  are  recurrsion  relations  that  enable  one  to  compute 


f functions  that  satisfy  rational  differential  equations. 
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As  a matter  of  fact,  the  discussions  in  [1]  and  [7]  show  that  it  is  possible  to  write 
such  a recurrsion  relations  for  functions  satisfying  differential  equations  of  the 
form  y'  = f(t,y)  where  f is  a factorable  function  and  £ is  a set  containing 
the  arithmetic  operations  and  functions  satisfying  rational  differential  equations. 

If  we  make  systematic  use  of  the  chain  rule  and  the  recurrsion  rules  for 
computing  successive  derivatives  of  basic  library  functions,  we  could  construct, 
using  a basic  representation  of  a function,  a factorable  representation  of  the 
original  function  and  its  successive  derivatives. 

Example  4 : 

Let  us  take  £ to  be  as  in  Example  3.  It  is  not  hard  to  see  how  one  could 
derive  recurrsion  relations  that  will  enable  one  to  compute  partial  derivatives. 


The  following  is  motivated  by  the  above  examples: 

Let  £ be  a set  of  basic  library  functions  and  let  T be  an  operator  that 
maps  functions  to  functions.  Let  us  assume  the  following: 

a)  T is  defined  for  all  factorable  functions. 

b)  If  f:DC  Rn  - R then  T[f]:EC  Rn‘k-Rk  where  E = DX  R(k'!)'n, 

c)  T[( f 1 , . • ■ } f1")]  = (Tff1], . . . ,T[fm]). 

a)  For  every  g e £ G = T[g]  is  a factorable  function. 

e)  Tor  any  ge  £,  g :D  C RS-*  R and  for  every  f:DC  Rn-*D 

T[g(f)]  = T [g  ] (T  [ f ] ) . 

f)  T[c]  (c  a constant  function)  is  a constant  k-vector  function 

and  3 a factorable  function  C : R -*  R such  that  T[c]  = C(c)  V c t R. 
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Theorem : 


Let  £ and  T satisfy  the  above  conditions.  Let  f be  a factorable  function 
f:DC  Rn-»  R and  let  f,  ...,f  be  a basic  representation  of  f , then  T[f]  Isa 
factorable  function.  Moreover  if  we  replace  the  terms  f^, . . . ,f^  by  F^, . . . . 
where : 

i)  For  1 < i < n F^  = where  | ^ is  a k-coordinate  projection  from 
Rn  k -*•  Rk,  that  is : 

TTi(X)  = (Xk(i'1)+1,Xk(l‘1)+2,.  . . ,Xl'k) 

ii)  For  n < i < L 

"G(F  ,...,F  ) if  f = g(f.  ,...,f.  ) 

J1  3s  J1  }s 

r‘ i 

^ C(c)  if  ft  = c 

then  we  get  a factorable  representation  of  T[f ] . 


Proof : 


The  proof  by  induction  on  the  length  of  the  sequence  representing  f is 
immediate  . 


Remark : 

The  replacement  rules  are  simple  and  can  be  carried  out  mechanically. 
Corollary : 

Let  f:D  C Rn  -*  Rm  be  a piecewise  factorable  function.  Let  Uj,  U^>.  • • 
be  sets  such  that  DCUU,  and  f|^  is  a factorable  function. 

Let  f .,f.  be  basic  representation  of  f|  and  let 

i,l  i,2  ijbj  UA 

F F.  be  the  corresponding  "Replacement  Sequence".  Define  F by 

i , l i,  L, 
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F|  " where  X ^ then  F is  a piecewise  factorable 

Wi  1 

function  and  Fl  = X[f|  ] 

‘Wj  lu 

4.  Taylor  Series  expansion: 

Let  £ be  a set  that  consists  of  the  identity  function,  the  arithmetic 
operations,  functions  that  satisfy  first  order  rational  differential  equations  (like 
Exp,  and  log),  pairs  of  functions  satisfying  second  order  rational  differential 
equations  (like  Cos  and  Sin),  and  so  on. 

Choose  £ in  such  a way  that  all  recurrsion  relations  between  successive 
derivatives  can  be  expressed  as  factorable  functions. 

Let  k > 1 be  an  integer.  We  define  T,  as  follows- 

k 

Let  f ;W  C RS-*  R be  of  class  Ck  . Let  F = Tff]  be  the  function  satisfying: 

a)  F:WXRS'k  -*  Rk+1 

b)  For  every  function  X:  R — RS  , X c Ck 
and  every  tQ  « R such  that  X(tQ ) c W 

n[X(t0)]0,  rX(t())]1,  . . . , [X(t0)]k)  = ([f(X(t0))]0,...,[f(X(t0))]k)  . 

It  is  not  hard  to  see  that  the  above  Tk  and  £ satisfy  the  assumptions  of  Theorem  1. 

Note  that  the  set  £ contains  only  analytic  functions  and  therefore  the 
factorable  functions  are  analytic. 
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5.  Gradients : 


8f 


Let  f(j)  denote  ^ 


Let  z be  as  above.  Let  k > 1 be  an  integer. 


Define  T,  as  follows: 
k 

Let  f:  WCRS  - R 
satisfying : 

a)  F:WXRsk- 


be  of  class 

Rk+1 


and  define  F = T[f]  to  be  the  function 


b)  b)  For  every  function  h(  c'  , 

h:Rk-RS  if  h(X0)eW 
then  F(h(XQ),h(1)(X0),  ...,  h(M,X0» 

= (f(h(X0)),  l(1)(b{X0)),  f(k)(h<X0,,)- 

Again  it  is  not  hard  to  see  that  the  above  ? and  satisfy  the  assump- 
tions of  Theorem  1. 
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6. 


Iterative  procedure: 


Many  functions  are  computed  iteratively.  Tire  number  of  iterations 
in  the  computer  program  must  be  finite  of  course.  This  number  however 
can  change  according  to  the  values  of  the  arguments.  The  usual  arrangement 
in  iterative  procedure  is  as  follows:  One  prescribes  a tolerance  e and  sometimes 
an  initial  guess.  The  program  then  proceeds  with  the  iterations  until  the 
change  in  the  function  value,  or  the  estimated  error  is  < t . Once  f is 
fixed,  the  number  of  iterations  as  a function  of  the  arguments  is,  in  most 
cases,  piecewise  constant.  Since  interativc  procedure  can  differ  considerably, 
we  cannot  say  what  are  the  precise  conditions  that  make  functions  that 
are  computed  iteratively,  piecewise  factorable.  However  by  careful 
study  of  a particular  problem  at  hand,  in  many  cases,  one  can  show  that 
the  function  actually  computed  is  in  fact  piecewise  factorable. 

In  such  a case  if  one  computes  derivatives  of  that  function, 
one  actually  computes  the  derivatives  of  a piecewise  factorable 
function.  These  derivatives  might  not  be  a good  approximation 
to  the  derivatives  we  had  in  mind.  However  many  times  one  can  use  the 
following  classical  theorem  (see  [ 15]  ). 

Theorem:  If  f^  is  a sequence  of  analytic  functions  In  the  complex 

plane  and  f -»  f uniformly  on  a closed  disc  D(x0,  r)  then  f is  analytic 

in  the  disc,  f^—  f^,  uniformly  and  Vz  t D(xQ,r) 

7T  |f{k)(z)  - f(k\z)  | <4r  sup  If  (w)  - f(w)  | . 

r w<.D(x0,r) 

% 
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7.  Taylor  Scries  Expansions  of  Implicit  Functions : 


Let  f be  a factorable  function  f:  R X Rn  -►  Rn  and  assume  that 

f(Vx0>  ~ 0 > « R>  x0.  Rn  and  that  r = f^it^Xg)  exists.  Then  the 

system  of  equations  f(t,x(t))  s 0 defines  implicitly  a unique  function 

x : R-*  R 3 x(tQ)  = xo  an<^  f(t,x(t))  s 0 in  the  neighborhood  of  (t^,x  ). 

Moreover,  from  Theorem  20.  3 in  [6,  Ch  5,  p.  329-332]  it  follows  that  if 
1-1 

Vt)  = f(t'S  ix(t0)],  • (t-t  )J) 

j=0  J u 

then 


i)  For  m < i [g . (t  )1  = 0 

l 0 *m 

ill  wy],  = - r tgJn0)]1  . 

Clearly  g . (t ) is  a factorable  function  and  one  can  compute  fq  (t  )1  There- 
i 1 i 0 Ji 

fore  one  can  compute  [x(1q)11  by  the  following  bootstrap  procedure: 


Start  by  the  regular  Newton  method  to  find  x such  that  ,t(t  ,x  ) = 0 

0 0 0 

then  set  r = f^  (t^.x^)  ^ Ibis  P0*111  is  P°ssible  to  compute  (x(t(i)]j  . 

Once  one  has  [x(tg)]j  one  can  compute  and  soon. 


a)  Band  gradients:  Many  times  the  gradient  matrix  is  in  band  form  (for 
example  in  the  numerical  solution  of  a two-point  lx  undary  value  problem).  In 
such  a case  the  gradient  can  be  computed  with  in  it  saving  in  time  and 
space. 

Lot  i:Rn-»  Rn  do  a factorable  function  (pie  ewise  factorable), 
and  assume  tl  t,for  all  j,  f*  depends  only  on  {xl|  where  |l  - i < l ) 

Assume  t to  compute  I 

ll  ' 

with  tli*  matrix 


b)  The  general  case:  Many  times  the  gradient  matrix  is  sparse  but 
does  not  have  a structure  that  is  easy  to  take  advantage  of.  However  it  is  not 

• of  tin  entries  of  the  matrix  are  identically 

zero,  tine  can  carry  this  information  with  the  computation  and  use  the  following 


obvious  fact: 


and  if 


vt 

1,  • Uls-'-o  ) 


Nx>  ffg.fxf, . . . ,g  f x)) 
s 


I { *0}i  t J J 

* 1 ax  ' - t l \ 
t 


In  the  lr 


• m "i , < i h variable  will  be  a pointer  to  a vector 


whu  h will  i <■*  . » lt<  • I •.  t p irti  tl  derivatives.  Each  subroutine 


that  icpli  «]  a call  h 


i lib’  iry  fur  ti  n will  cor  pub  function  values 


and  n •i  t"  parti  1 detl\  »tivi  s only.  The  subroutine  will  create  a list  of  the 
nonri'ro  parti  1 derivatives  of  the  composition. 


The  implementation  ot  automatic  computation  of  ( irtial  derivatives 
of  lO.rrilAN  fun  tlori'.  (ti!’APlENT)  <!■  --.  nibed  in  this  report  does  not  use 
the  al  *ve  method.  We  plm  t<  r upK'incitl  this  method  m the  near  future. 


9.  The  Implementation. 


9.1.  Introduction. 

In  earlier  sections  it  was  pointed  out  that  most  computer  functions 
and  subroutines  used  in  numerical  computations  compute  (represent) 
piecewise  factorable  functions.  It  was  shown  that  every  sequence 
representing  a piecewise  factorable  function  can  be  transformed  into 
another  sequence  which  represents  the  original  piecewise  factorable 
function  and  its  total  or  partial  derivatives.  The  translation  process  is 
merely  a replacement  process  and  can  be  carried  out  mechanically. 

In  order  to  implement  such  a replacement  process  one  needs  a 
processor  that  will  do  the  following: 

a)  Break  the  subprogram  into  a sequence  of  one  step  terms. 

b)  Replace  each  term  by  a body  of  code. 

c)  Fxpnnd  each  program  variable  into  a vector.  The  size  of  that 
vector  will  depend  on  the  order  of  the  operator  implemented. 

d)  The  processor  should  leave  the  control  structure  unaltered, 
that  is:  Do  loops  and  IT  statements  should  be  left  unaltered. 

There  are  three  principal  ways  to  implement  the  replacement  process. 

a)  Macro  expansion. 

b)  Replacing  each  term  by  a subroutine  call. 

c)  Using  an  interpreter. 
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We  chose  to  use  the  second  method  mainly  because  we  could  use  an 
existing  precompiler:  AUGMENT.  We  also  feel  that  the  second  method  is 
the  most  powerful  and  flexible  of  the  three. 

9.2.  The  AUGMENT  Precompiler. 

Since  we  are  using  AUGMENT  as  the  main  tool  for  the  implementation 
of  automatic  differentiation,  we  find  it  appropriate  to  give  a very  short 
description  of  the  function  and  use  of  AUGMENT.  However,  in  order  to 
understand  fully  how  to  use  it,  the  user  should  read  [3]. 

The  AUGMENT  precompiler  was  designed  to  simplify  the  use  of 
nonstandard  data  types  in  TORTRAN.  AUGMENT  enables  one  to  define 
new  data  types  and  operations.  It  enables  one  to  write  TORTRAN  programs 
using  these  new  data  types  as  though  they  were  standard.  AUGMENT 
input  consists  of  programs  written  in  "extended"  FORTRAN,  that  is; 
FORTRAN  programs  using  nonstandard  data  types,  operators,  and  functions. 
AUGMENT  translates  the  input  programs  into  standard  FORTRAN  programs 
with  the  nonstandard  constructs  translated  into  subroutine  and  function 
calls.  The  supjoiting  package  (that  is,  the  above  subroutines  and 
functi  ■ : i s ) implement  the  operations  w th  the  nonstandard  data  types. 

In  order  to  implement  a new  data  type  with  AUGMENT  one  as  to  do 
two  things 

1)  write  a package  of  subroutines  to  implement  the  operations  and 
functions  defined  on  the  new  data  type, 

2)  write  a dcsrp'ntion  deck  which  describes  the  new  data  type. 
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In  the  description  deck  one  provides  AUGMENT  with  the  following 
information: 

a)  The  name(s)  of  the  new  data  type(s). 

b)  The  number  and  type  of  computer  words  to  reserve  for  each 
nonstandard  variable. 

c)  The  set  of  operations  and  "stands  functions  defined  on  the 
new  data  type. 

d)  The  names  and  calling  sequences  of  the  subroutines  and  functions 
that  implement  the  above  operations. 

e)  The  relations  between  the  new  data  types  and  other  data  types 
(standard  and  nonstandard). 

Tor  more  detailed  information  see  [3]. 

9.3.  GRADIENT  and  TAYLOR  Packages. 

This  report  describes  the  implementation  of  two  types  of  differentiation: 

1)  TAYLOR:  Automatic  Taylor  series  expansion  of  FORTItAN  functions. 

2)  GRADIENT:  Automatic  gradient  computation  of  FORTRAN  functions. 

The  automatic  differentiation  is  implemented  by  providing  two  new  data 
types:  TAYLOR  and  GRADIENT.  The  operations  defined  on  the  new  data 
types  are  the  arithmetic  operations  and  almost  all  the  standard  FORTRAN 
functions.  Each  subroutine  that  implements  a nonstandard  operation 
computes  (represents)  the  corresponding  factorable  function  G of  Theorem  1. 
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Suppose  one  wants  to  compute  the  gradient  of  a function  f,  whicli 
is  a function  of  n variables.  One  has  only  to  write  a FORTRAN  function 
(subroutine)  that  computes  f.  In  the  function  or  subroutine  code,  one 
declares  the  arguments  and  other  variables  (including  the  function  itself) 
as  type  GRADIENT.  Then  one  submits  this  function  with  the  description 
deck  (which  is  part  of  the  package)  to  AUGMENT  as  data.  The  output 
from  AUGMENT  will  be  the  desired  subroutine.  AUGMENT  will  translate 
the  function  into  a TORTRAN  subroutine  written  in  ANSI  standard  FORTRAN, 
declaring  each  GRADIENT  variable  as  a REAL  vector  of  dimension  n + 1 
(and  each  k vector  as  an  (n  f l)  X k REAL  array  and  so  on).  Each 
arithmetic  operation  or  function  call  will  be  translated  tr  call  to  the 
appropriate  subroutine.  The  translated  subroutine  together  with  the  sub- 
routines provided  by  the  GRADIENT  package  will  compute  the  gradient  of 
the  function  f at  any  desired  point. 

Below  we  give  a detailed  description  of  tne  two  packages  and  their 
use.  Most  of  the  details  of  the  two  packages  are  the  same:  Anything  that  is 
said  below  applied  to  botn  packages,  unless  the  contrary  is  explicitly  stated. 
We  will  use  the  term  VARIABLE  for  either  type  GRADIENT  or  TAYLOR  and 
CONSTANT  for  types  REAL,  INTEGER  or  DOUBLE  PRECISION.  The  relations 
between  VARIABLE  and  type  COMPLEX  are  undefined. 


279 


1 


TAYLOR  and  GRADIENT  Vnri jblos: 

» ‘ ' ""  ~ 

Each  GRADIENT  variable  is  a REAL  vector  of  dimension  N + 1 where 
N is  the  number  of  the  independent  arguments.  The  first  word  holds  the 
variable  value  and  the  (I  + l)th  word  holds  the  partial  derivative  of 
that  variable  with  respect  to  the  Ith  independent  argument. 

Each  TAYLOR  variable  is  a Real  vector  of  dimension  N 4 l where 
N is  the  highest  normalized  derivative  to  be  computed.  The  (I  4 1 ) th 
place  holds  the  Ith  normalized  derivative,  I = 0,  1,  . . .,  N. 

Arithmetic  Operations: 

All  arithmetic  operations  between  VARIABLES  and  all  arithmetic 
operations  between  VARIABLE  and  CONSTANT  are  legal  except  INTEGER 
raised  to  a VARIABLE  power.  Since  the  recurrsion  relations  that  replace 
arithmetic  operations  between  CONSTANT  and  VARIABLE  are  simpler  than  the 
general  relations,  separate  routines  are  provided  to  implement  the 
arithmetic  operations  between  CONSTANTS  and  VARIABLES.  Conversion  of 
CONSTANT  to  a vector  format  is  done  if  there  is  a statement  of  the  form 
V = c where  V is  a VARIABLE  and  c is  a REAL  expression,  or  if  there 
is  a reference  to  a conversion  function  (see  Conversion  routines). 

Standard  Functions: 

In  Table  1,  we  list  the  standard  functions  that  are  implemented  in 
the  two  packages.  One  can  easily  add  other  functions  to  that  list  by  adding 
their  names  to  the  description  deck  and  writing  subroutines  to  implement 

them. 
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Tabic  1 ■ 


Function  reference 

Function 

Suffix 

Comments 

ABS{x) 

|x| 

ABS 

ACOS(x)t 

-1,  \ 
cos  (x) 

ACS 

ALOG(x)* 

f n(x) 

LN 

ALOGlO(x) 

log)0(X) 

LOG 

AMAXl(x,y)* 

max(x,y) 

MAX 

function  of  two  arguments 

AMINl(x,y)^ 

min(x,y) 

MIN 

only 

ii 

ATAN(x) 

tan  \x) 

ATN 

ASlN(x)^ 

■ -1,  , 
sin  (x) 

ASN 

CBRT(x)  + 

x>/3 

CBR 

COS(x) 

cos(x) 

COS 

COSH(x)1 

cosh(x) 

CSH 

COTAN(x)^ 

cotan (x) 

CTN 

EXF(x) 

exp(x) 

EXF 

LOG(x)^ 

f n(x) 

LN 

MAX(x,y) 

max(x,y) 

MAX 

function  of  two  arguments 

MlN(x,y) 

min(x,y) 

MIN 

only 

II 

SIN(x) 

sin(x) 

SIN 

SlNH(x)^ 

sinh(  x) 

SNH 

SQRT(x) 

X2 

SQR 

TAN(x)^ 

tan(x) 

TAN 

Not  an  ANRI  standard  function. 

°ec  automatic  typing. 
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compilers,  automatic  typing  of  functions;  that  is,  the  type  of  function 
used  is  determined  by  its  arguments.  Thus,  for  example,  vve  have  defined 
LOG  and  ALOG  to  be  names  of  the  function  which  takes  the  logarithm 
of  an  argument  of  type  VARIABLE. 

Gonversion  Functions: 

There  are  three  subroutines  that  implement  conversion  from  CONSTANT 
to  VARIABLE,  one  each  for  types  REAL,  INTEGER  and  DOUBLE  PRECISION. 

These  routines  can  be  referenced  in  the  original  program  by  the  use  of 
the  conversion  function:  CTTYL(-)  in  TAYLOR  and  CTGRD(-)  in  GRADIENT. 

The  function  accepts  oil  three  standard  types  as  arguments  (see  automatic 
typing).  Automatic  conversion  is  invoked  only  for  type  REAL.  That  is: 
the  statement  V = const  is  legal  only  if  the  const,  is  of  type  REAL. 

Norm  E unction : 

It  is  sometimes  convenient  to  test  the  distance  between  two  VARIABLES 

(for  example  in  a test  of  convergence).  Since  the  relational  operators  compare 

only  the  first  words  of  the  VARIABLES  they’  cannot  be  used  for  that  purpose. 

The  packages  provide  a function  NORM  that  computes  the  distance  of  a 

VARIABLE  from  the  0 vector.  In  TAYLOR  package,  the  function  NORM  is  a 

function  of  two  arguments,  TAYLOR  and  REAL 

NORM(v,  t)  - max  I [ v ) . It1 
0 <i  <N  1 
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In  GRADIENT,  NORM  is  a function  of  one  argument 

NORM(V)  = max  (|v  |)  . 

0<i<N  1 

In  both  packages  the  function  is  implemented  as  REAL  function. 

Logical  Statements : 

The  relational  operators  can  be  used  to  compare  two  VARIABLES,  or 
VARIABLE  with  type  REAL.  The  comparison  is  done  between  the  first  words 
of  the  VARIABLES  or  between  the  first  word  of  the  VARIABLE  and  type  REAL. 

The  comparison  operators  are  implemented  as  LOGICAL  functions. 

Other  Subroutines: 

The  packages  provide  two  additional  subroutines: 

1)  Error  handling  subroutine  (see  our  later  discussion  of  Error  handling). 

2)  Copy  subroutine. 

The  copy  subroutine  implements  the  statement  A = B,  A and  B VARIABLES. 
Subroutine  Names : 

The  names  of  all  subroutines  in  both  supporting  packages  are 
composed  of  two  parts: 

i)  The  first  three  letters  (the  prefix), 
ii)  The  last  three  or  two  letters  (the  suffix). 

All  the  routines  in  each  package  have  a common  prefix:  TYL  in 
TAYLOR  and  C.RD  in  GRADIENT.  In  order  to  avoid  name  conflicts,  the 
user  should  avoid  using  names  starting  with  the  above  prefixes. 
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The  suffix  of  a subroutine's  name  depends  on  the  function  or 
operation  the  supporting  routine  implements.  In  Table  1,  we  give  the 
suffixes  of  the  routines  implementing  the  "Standard"  functions.  The 
suffix  of  a routine  implementing  arithmetic  operations  is  given  systematically. 
The  first  letter  describes  the  operator.  A for  +,  S for  M for  *, 

D for  /,  and  E for  **.  The  next  two  letters  describe  the  operands: 
first  the  left  operand  and  then  the  right  one.  The  letter  R stands  for 
REAL,  I for  INTEGER,  D for  DOUBLE  PRECISION  and  V for  VARIABLE.  So 
MW  will  be  the  suffix  of  a routine  that  implements  (VARIABLE)  * (VARIABLE) 
and  DDV  of  a routine  that  implements  (DOUBLE  PRECISION)/( VARIABLE) . 

The  suffix  of  the  LOGICAL  functions  implementing  the  relational 
operators  is  composed  of  the  two  letters  representing  the  operator  and  the 
letter  V.  So  . LT.  is  implemented  by  a function  with  suffix  LTV.  The 
suffix  of  the  routine  implementing  the  norm  function  is  NRM,  Error  routine  - ERR 
and  copy  routine  - CPY. 

9 . 1 . Using _thc  Package  wi th  AUGMENT. 

Wri ting  the  Source  Code: 

In  order  to  get  derivatives  of  a function,  say  the  Taylor  series 
expansion  of  a function  f,  the  user  should  write  an  "extended"  FORTRAN 
function  or  subroutine  that  computes  f.  All  legal  FORTRAN  constructs  can 
be  used.  All  program  variables  which  depend  on  the  independent  variable 
should  be  declared  as  type  TAYLOR,  including  the  function  itself.  Program 
vai  iablos  which  do  not  depend  on  the  independent  variable  can  be  of  any 
other  type. 
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External  functions  and  subroutines  can  be  used  as  part  of  the 
computation.  Functions  must  be  declared  as  type  TAYLOR.  External 
functions  and  subroutines  can  be  translated  separately.  However,  one 
has  to  make  sure  that  the  number  of  computer  words  reserved  for  each 
TAYLOR  variable  in  the  external  subroutine  (function)  is  the  same  as 
the  number  of  words  reserved  in  the  calling  routine. 

The  Description  Deck: 

The  next  step  is  to  submit  the  source  deck  with  the  description  deck 
as  data  to  AUGMENT.  See  Appendix  C for  the  deck  structure.  The 
description  deck  is  supplied  with  the  package. 

However,  since  the  number  of  words  reserved  for  each  VARIABLE 
changes  from  problem  to  problem,  this  number  has  to  be  put  into  the 
description  deck  each  time.  To  make  it  easier,  the  description  dec*  was 
split  into  two  parts:  HEAD  and  BODY.  The  number  of  words  to  reserve  is 
inserted  in  between  the  two  parts.  This  number  should  be  an  integer 
constant.  Column  1 in  the  card  holding  this  number  must  be  left  blank. 

Order  of  Differentiation: 

The  routines  in  both  packages  were  designed  to  implement  any  "order" 
of  differentiation  without  the  need  to  be  recompiled  every  time  the  "order" 
is  changed.  The  "order"  of  differentiation  is  provided  to  the  package  through 
a common  block.  In  TAYLOR  by  COMMON/DEGREE/N  and  in  GRADIENT  by 
COMMON/ORDER/N  N is  the  order  of  the  operator,  that  is:  if  N = I 
the  package  will  compute  function  values  only;  if  N = 2,  function  values 


285 


and  first  derivatives  (or  first  partial  with  respect  to  one  variable);  and 
so  on.  The  routines  in  the  package  do  not  check  that  there  is  enough 
space  provided  for  the  VARIABLES.  However  there  is  a check  that  N > 1. 

The  order  can  be  changed  at  run  time  but  care  should  be  taken  not  to 
exceed  the  number  of  words  provided  for  each  VARIABLE. 

Working  Space: 

Some  routines  in  TAYLOR  need  work  space  and,  since  they  are 
designed  to  handle  any  order  of  differentiation,  the  work  space  has  to  be 
provided  by  the  user.  The  work  space  is  provided  through  four  common  block 
COMMON/WORKl/WORKl(N) 

COM  M0N/WORK2/WO  RK2(N) 

COMM0N/WORK3/WORK3(N) 

COMMON/WORK4/WORK4(N) 

N should  be  the  highest  order  of  differentiation  used.  The  GRADIENT 
package  does  not  require  work  space. 

Using  the  Translated  Routine: 

The  translated  routine  is  a FORTRAN  subroutine  that  gets  as  input 

the  value  of  its  arguments  and  their  derivatives,  and  gives  as  output  the 

value  of  the  function  and  its  derivatives,  for  example,  in  the  Taylor 

package,  if  t is  the  independent  variable,  then  t,  1,0,0...  is  the 

Taylor  series  expansion  of  t.  In  the  Gradient  package,  if  x is  the 

independent  variable  then  ^7*=  1 and  ^7-=  0 for.  i * j. 

c)X , 0X, 

1 1 
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Example 


In  this  example  wo  compute  the  first  9 terms  of  the  Taylor  series 
expansion  of  f(t)  = exp(cos(4t))  + arctan(sin  (t))  at  the  point  t = .5. 
a)  The  source  code: 


TAYLOR  FUNCTION  FUN ( T ) 

TAYLOR  T 

FUN=EXP(CUSM.*T>  HATAN(SIN(T>**2) 

RETURN 

F.ND 


b)  The  translated  code: 


SUBROUTINE  FUN  <T»  TYLRLT) 

C PROCESSED  BY  AUGMENT r VERSION  41  ===== 

C TEMPORARY  STORAGE  LOCATIONS  

C TAYLOR 

REAL  TYl.TMP(9»2> 

C LOCAL  VARIABLES  

C TAYLOR 

REAL  TYl.RCS  ( 9 ) 

C GLOBAL  VARIABLES  

C TAYLOR 

REAL  T ( 9 ) r TYLRt  T ( 9 ) 

C -"-===  = TRANSLATED  PROGRAM  = ===  = 


CALL  TYL.MRV  < 4 . r T r TYLTMP  ( 1 1 1 ) ) 

CALL  TYLCOS  (TYLTMP ( 1 f 1 > » TYLTMP < 1 r 1 ) ) 

CALL  TYLEXP  ( TYLTMP ( 1 , 1 > , TYLTMP! 1 » 1 > ) 

CALL  TYLSIN  ( T , TYLTMP < 1 » 2 ) ) 

CAL  L TYLEVT  ( TYLTMP ( 1 r 2 ) r TYLTMP ( 1 » 2 ) ) 

CALL.  TYLATN  < TYLTMP  ( 1 » 2 ),  TYLTMP  ( 1 ,?)  ) 

CALL  TYLAVV  ( TYLTMP <1»1>»TYLTMP(1»2>» TYLRES) 
GC)  TO  30000 

C RETURN  CODE 

30000  CONTINUE 

CALI.  TYLCPY  ( TYLRES  t TYLRLT ) 

RETURN 

END 
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c)  Main  program: 


IUMFNS  'ON  T (9)  »FUNC<9) 

common  /nrora  r:/  n 

COMMON  /IJOrcKl/Ut  (9) 

COMMON  /U(IRk?/U2<9) 

COMMON  /U0KN3/U3 < 9 ) 

COMMON  /llORk  »/U4(9) 

N- 9 

T < 1 > -- . 3 
T ( 2 ) - 1 . 

00  10  1=3.9 
10  T < I ) =0 . 

C.ll.L  FUN(T.rUNC) 

PRINT91 . T(1 ) . (FUNC(L)  »L=1 .9) 

91  FORMATdX. 'EXAMPLE:  TAYLOR  SERIES  EXPANSION '»/. 3X »' T=' > F6 . 3//, 

* (3X.E13.5)) 

STOP 

END 


d)  Output: 


EXAMPLE:  TAYLOR  SERIES  EXPANSION 
T=  .300 

.08331100 
- . 13990 « 01 
.69231+01 
-.77433+01 
-.34296+01 
.33406+02 
- . 39099  f 02 
. 20334+02 
. 1 0762 1 03 


In  Appendix  B we  give  a more  complicated  example. 


Error  Handling: 


In  general  the  packages  do  not  check  that  the  arguments  arc  within 


the  domain  of  definition  of  the  functions.  Checking  is  done  only  for  division 


by  REAL,  INTEGER  or  VARIABLE  types.  The  packages  also  provide  the 
capability  to  specify  what  constitutes  division  by  zero.  If  the  absolute  value 


of  an  argument  to  division  routine  is  smaller  than  or  equal  to  specified 


value,  error  occurs.  This  value  is  set  initially  to  0.0  by  a DATA  state- 
ment. However  it  can  bo  changed  in  runtime  through  common  block 
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/2ERO/LTS.  The  routines  in  the  package  also  check  that  the  order  of 
differentiation  is  at  least  1.  In  cose  an  error  is  discovered,  the  error 
routine  is  called  (suffix  ERR).  The  error  routine  prints  error  messages 
and  performs  a "walk  back"  (which  traces  the  sequence  of  subprogram 
calls  back  to  the  main  program)  and  stops. 

Non  ANSI  FORTRAN  Parts: 

No  special  care  was  taken  to  comply  with  some  of  the  restrictions 
imposed  by  ANSI  FORTRAN,  for  two  reasons: 

a)  Some  of  the  features  in  the  packages  could  not  have  been 
implemented  otherwise. 

b)  Some  of  the. restrictions  violated  seem  to  us  arbitrary  and  unreason- 
able. 

■ hey  do  not  exist  in  most  production  compilers. 

Below  we  give  the  list  of  non  ANSI  FORTRAN  constructions  used. 

1)  The  set  of  "standard"  functions  used  is  larger  than  the  set  in 

ANSI  FORTRAN.  Functions  which  are  not  in  the  Standard  are  flagged 
in  the  function  table  by  The  corresponding  routines  could  be 
deleted  or  modified  to  fit  different  systems. 

2)  The  work  space  to  TAYLOR  is  provided  through  common  blocks 
and  therefore  the  common  block  sizes  in  tire  TAYLOR  routiires 
would  be  different  from  the  block  sizes  in  tire  main  program. 

3)  The  walk  back  routine  used  in  the  error  handling  routine  is  non- 
standard and  has  to  be  changed  in  other  systems. 
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4)  The  RTAL  variable  KPS  appears  in  common  block  /ZERO/  and  in 
DATA  statement  (in  the  error  routine). 

5)  No  care  was  taken  to  comply  with  the  standard  concerning 
expressions  as  subscripts  to  arrays. 

6)  The  function  ATAN2  is  not  implemented. 

7)  No  care  was  taken  not  to  mix  INTEGER  with  REAL. 

Using  the  Packaoe;  Summary : 

Assume  one  has  all  the  package  subroutines  in  relocatable  form,  so 
they  can  be  used  as  library  routines.  In  order  to  get  the  derivatives  of  a 
function  f one  has  to  do  the  following: 

A)  Write  a TORTliAN  subroutine^  (function)  that  computes  the  function 

a. 

f.  In  the  subroutine,  declare  all  FORTRAN  variables  that  depend  on  the 
independent  variables  as  type  TAYLOR  (or  GRADIENT).  The  rest  of  the 
variables  can  be  of  any  other  type. 

B)  Insert  the  number  of  computer  words  reserved  for  each  VARIABLE 
into  the  description  deck. 

C)  Submit  the  source  deck  with  the  appropriate  description  deck  as 
data  to  AUGMENT  (See  Appendix  C and  also  [ 3]). 

D)  Submit  the  output  from  AUGMENT  as  data  to  the  TORTRAN  compiler. 

E)  Call  the  subrout  ine  with  the  desired  arguments. 


t 


One  can  use  main 


programs  too  but  that  is  a more  complicated  way  of  doing  it. 
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Remark  s 


1)  Always  remember  that  the  initial  values  of  the  derivatives  of  the 
input  variables  have  to  be  provided  too. 

2)  All  the  restrictions  that  apply  to  the  use  of  AUGMENT  apply  to  the 
packages . 

3)  AUGMENT  does  not  translate  I/O  and  DATA  statements.  Their 
translation  has  to  be  done  by  hand. 

4)  This  report  is  by  no  means  a substitute  for  AUGMENT  user  informa- 
tion manual  (MRC  TSR  #1469  [ 3]).  1’he  user  should  be  familiar  with  that 
report. 
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Appendix  A 


Recurrence  Relations  for  Taylor  Series  Expansion: 

a)  Z = X ± Y 

[Zjj  = [X]j  ± [YJj  J = 0,1,... 

b)  Z = X • Y 

J 

[Z]T  = Z tx)k  • Iy1t  k J = 0,1,... 

J k =0  J"K 

c)  Z = X/Y 

1-1 

[Z] j = l/Y{[X]j-  Z [Z]k  • [V]J  k}  J =0,1,... 

d)  Z = X1  I integer 

n q 

1)  I >0  , I = Z Lc  2 Lc  ‘ {°,1} 

S=0  S 
jt,  L • 2b 

[Z]T  = [ ||  X S ] J = 0,1, . . . 

1 s=o  J 

2)  I = Q ; Z = 1 

3)  I<  0 IZ]J=  [1/X~1]J  1 = 0,1,... 

c)  Z = XS  a real  constant 

J-l 

[Z] j = 1/X  • Z ((a(j-k)-k)/j)  • [Z]k  • [X]J  k J =1,2,... 

Y 

f)  Z = X 

[Z];  = [EXP( Y • LOG(X))]  j = 0,1,... 

g)  Z = LOC7(X) 

IZ],  = [X]./X 

J-l 

[Z]j  = l/x  {[X)J  - Z ((J-k)4)[X]J  k ♦ [ Z]  k J = 2,  3, . . . 

h)  Z = EXP(X) 

J-l 

[Z]j  = Zo<(J-k>/J>IZ]k  ^ [Xlj-k  J =1,2,... 
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i) 


3) 


k) 


n 


m) 


W = COS(X) 


Z = SIN(X), 

J-l 

[Z]  = £ (<J-k)/j)[W]k.[X]T  k 
J k=0  J 


[W] 


J-l 


T = - Yj  (( ' [X] 

1 k=0  I- 


W = COSH(X) 


Z = SINH(X), 

J-l 

[Z]j  = X ((J-k)/J)[W]k  • [X]J  k 


k=0 

J-l 


[w]  = £ ((j-k)/j)  • [Z]  • [X] 

1 k=0  * 


Z = ATAN(X) 

.2 


Let  V = X“  + 1 W = 1/V 

[Z]j  = fwlk  • Mj.k 


Z = ASIN(X), 


Let  V = 1 - X 

I-1 


Y = ACOS(X) 
W = l/\Tv" 


[Z]T  = X (( J-k)/j)[W]  • [X] 

J k=0  K J 


J-l 


[Y]  = -V  ((j-k)/j)  [W]  • [X] 

J k=0  k J 


TAN(X)  = SIN(X)/ COS(X) 

n/IT  = x1 

TANH(X)  = SINH(X)/ COSII(X) 


J = 1,2, 


J = 1,2, 


J = 1,2,. 

J = 1,2,. 
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Appendix  B 


Example 

The  following  example  was  taken  from  a homework  assignment  given  in 
an  optimization  class  at  the  University  of  Wisconsin.  This  example  is  simple 
but  quite  typical  of  problems  that  arise  in  applications.  We  have  to  compute 
the  gradient  of  a function  which  can  be  easily  described  by  a computer  program, 
but  whose  explicit  expression  is  quite  hard  to  obtain.  When  one  tries  to  com- 
pute the  gradient  numerically,  one  runs  into  convergence  problems.  In  this  ex- 
ample we  show  how  easy  it  is  to  get  the  gradient  of  such  a function  by  using  the 
GRADIENT  package. 

The  Problem 


We  have  a missile  on  the  north  pole  of  a ball  of  radius  1 and  we  want 
to  fly  to  the  south  pole.  (The  units  are  chosen  in  such  a way  that  all  the  con- 
stants are  1).  Because  the  problem  is  symmetric  we  only  have  to  solve  a two 
dimensional  problem. 


The  motion  equat  ns  are: 


dZr 

dt2 


. 3 


+ u 


r = (x,y) 


Let  t = #t  then 


d 

~3t 


f*\ 


y 

% 


= a 


W 


■J 


= °'f(x,y,4,B>ul,u2) 


296 
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Discretize  by  the  Euler  method 


k+1 


k+1 


’k+1 


'k+1 


+ h«t(xk,yk,ek,.,k,u1A,u2)k) 


ht. 


k = 0,1,.  . . ,g 


h = 0,1 


Finally,  solve: 


min  M»,u1j0,"2(0,u))8,u2(8,u1)9,u2j9)  . 

= (zo[(x10)z  + (y,0  + i)2  + <£,/  + < V2‘  * <“?,</  + '>'2,o,Z 

20 


+ (U1,8)2  + (U2,8)2  + (U1,9)2  + (U2,9)2+  ~W~  + 2 2 

x5  + y 


} • 


All  u H 0 u.,  . = 0 for  1 < j < 8 . 

1,1  2,) 

Use  the  variable  metric  algorithm  to  solve  the  above  minimization  problem. 
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1.  INPUT  DECK 


PXC1T  MRC*I_ TP.  AUGMENT 

pa nii  gk*bife . pesc-grp/hfap 
8 

rapp  grip iff. bfsc-grb/bopy 

♦BEGIN 

UOfl,  IP  TEST! 

IMPLICIT  GRAB  TEN  I (A-H.O-Z) 

GRAB  I ENT  FUNCTION  FLTFN  ( AL  F A , 1)0 » UP , 1)9  ) 

PIMENSION  110(2),  U8  < 2 ) , U9 ( 2 > » U ( 2 r 1 0 > » X(  U ) , Y ( 1 1 ) , UY(  1 1 ) , VX<  1 1 ) 
X(1)=0.0 

Y ( 1 ) = 1 . 0 
VX(1>=0.0 
VY ( 1 > =0 . 0 
H= . 1 

BO  10  1=1,10 
IK  1 , T )”0 . 0 
UC’-DO.O 
10  CONTINUE 

11(1  ,1  ) =00(1 ) 

U ( 2 , 1 ) =1)0(2) 

U ( 1 , 9 > =1)8  ( 1 ) 

U ( 2 ,9 >=U8 ( 2) 

U(  .t  , 10)  -(19  ( 1 ) 
l)(  2, 10)-  1)9(2) 

BO  20  T 1 , 1 0 

R8= ( X<  I > **2+ Y( I ) #£2) #*1.5 
X ( 1 + 1 ) ==X  ( I > +H*Al  FA*UX  ( I ) 

Y ( 1 + 1 ) - Y ( 1 ) +H*AI  FAWY  ( T ) 

OX  ( T + J ) OX  ( I ) +M*Al.  F A*  ( IK  1 , T > X ( I ) /RS ) 

0 Y ( T + 1 ) V Y ( I ) I Tl* ALFA*  ( U ( 2 , I > -Y  ( I ) /RS ) 

20  CONTINUE 

FLTFN-20 . * ( X ( 1 1 ) **2+ ( Y ( 1 1 ) 1 1 )**2+VX( 1 1 ) *'*2 ! UY ( 1 1 ) **2) 

J>  1 1)0(1  ) I *2  + 110(2  ) **2+1)8 ( 1 ) **2+1)0 ( 2 ) **21 1)9 ( t )**2f()9(2)**2 
t 1 ( ALf  A+* 2 ) / 1 0 • 0 + 20 . 0/ ( X ( 6 ) **2  +Y ( A ) **2 ) 

RETURN 

ENB 

*FNB 
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Remarks. 


1)  @XQT  MROLIB.  AUGMENT  starts  the  execution  of  AUGMENT. 

2)  ®\DD  GK*DIFF.  DESC-GRD/HEAD,  adds  the  card  image  of  the  HEAD  part  of 
the  description  deck  into  the  run  stream. 

Similarly  (§ADD  GK-DIFF.  DESC-GRD/BODY  Adds  the  BODY  part. 

3)  The  function  is  'o  a subroutine.  The  function  and  gradient  values 

are  stored  in  tlv  ent  of  the  subroutine. 

4)  The  translated  subroutine  can  be  used  to  compute  function  and  gradient  values 
or  function  values  alone.  (See  Order  of  Differentiation) 

5)  The  rest  of  the  computation  details  are  omitted. 
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n n n n non 


OUTPUT  FROM  AUGMENT 


C 


10 


SUBROUTINE  EL  TEN  ( ALFA » UO , US , 119  , ORDK't.  T) 

=====  PROCESSED  DY  AUGMENT.  VERSION  4H  === 

TEMPORARY  STORAGE  LOCATIONS  

GRADIENT 

REAL  GRDTMP (0.3) 

LOCAL  VARIABLES  

INTEGER  I 

GRADIENT 


REAL  H ( 8 ) , RS  < 8 ) . U(B,2,10),  VX(8»11>.  VY(8,11>»  X(8,ll>,  Y(8.11). 
* GRDRESf 8 ) 

GLOBAL  VARIABLES  

GRADIENT 


REAL  ALFA (8).  U0(B,2>,  US (8.2).  U9(8,2>.  GRDRLT  < 8 ) 
=====  TRANSLATED  PROGRAM  ===== 

CALL  GRDCER  (O.O.X(l.l)) 

CALL  GRDCER  (l.O.YO.l)) 

CALL  GRDCER  ( 0 . 0 » VX ( 1 . 1 ) ) 

CALL  GRDCER  ( 0 . 0 . VY ( 1 . 1 > ) 

CALL  GRDCER  (.l.H) 

DO  10  1=1,10 

CALL  GRDCER  (0.0,11(1,1,1)) 

CALL  GRDCFR  (0.0,11(1,2,1)) 


CONTINUE 
CALL  GRDCPY 
GRDCPY 
GRDCPY 
GRDCPY 
GRDCPY 
GRDCPY 
I - 1 


CALL 
CALL 
CALL 
CALL 
CALL 
DO  20 


CAl  I. 
C ALL 
CALI 
CAl  L 
CALL 
CALL 
CALL 
CALL 
CALI 
CAL  L 
CAL  I. 
CALL 


GRP  IV I 
GRDEVT 
GRDAVV 
GRDLVR 
GRDMVV 
GRDMVV 
GRDAVV 
GRDMVV 
GRDMVV 
GRDAVV 
GRDMVV 
GRDDVV 


( IJO  ( 1 , 1 ) , U ( 1 , 1 , 1 ) ) 
( UO ( 1 , 2 ) , U ( 1 , 2 , 1 ) > 
(IJ8(  1 , 1 ) ,U(  1 , 1 ,9)  ) 


( UR  ( 1 
( 1)9  ( .1 
(09(1 
,10 
(X(l 
( Y(  1 
( GRDTMP (1 
( GRDTMP ( I 


2 ) . U ( 1 , 
1 ) ,U(  1 , 
2 ) . U ( 1 i 


.9)  ) 

■ 10)  ) 
, 10)  ) 


I > i 
■ I ) i 


, GRDTMP (1  , 1 ) > 

, GRDTMP (1,2) ) 

, 1 ) , GRDTMP (1,2), GRDTMP (1,2)) 
, 2 ) , 1 . S , RS ) 


( H, AL EA, GRDTMP (1 . 1 ) ) 

( GRDTMP ( 1 , 1 ) , VX ( 1 , I ) , GRDTMP ( 1 . 
( X ( 1 , 1 ) .GRDTMP < 1 , 1 ) ,X(1 , 1 + 1 > > 

( H, ALFA, GRDTMP ( 1,1)) 

( GRDTMP ( 1 , 1 ) . VY ( 1 , I ) , GRDTMP ( 1 , 
( Y ( I , I ) , GRDTMP ( 1 , 1 ) , Y ( 1 , IP1  ) ) 

( 1! , ALFA , GRDTMP  ( 1.1)) 

( X ( 1 , 1 > , RS , GRDTMP (1,2)) 


t ) ) 


1 ) ) 


CAl  I.  GRDSVV 
CAl L GRDMVV 
CALL  GRDAVV 
CALI  GRDMVV 
CAl  I GRDDVV 
CALL  GRDSVV 


OKI, 1,1),  GRDT MP (1,2), GRDTMP (1,2)) 

( GRD  T MP (1,1), GRDTMP (1,2),  GRDTMP  (1,2)) 
(VX(! , I) .GRDTMP (l ,2) . VX( 1 , 1 + 1 > > 

(LI.  ALP  A,  GRDTMP  ( I . L ) ) 

(Y(l , I) , RS , GRD1 MP (1,2) ) 

01(1,2,1), GRD I MP (1,2), GRDTMP (1,2)) 
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CM  L GRDMOO  ( GRPTMP  (1,1),  GRPTMP  (1,2),  GRPTMP  (1,2)) 
CALL  GRPAOO  ( 0Y( 1 , 1 ) , GRDTMP ( 1 , 2 ) , 0Y( 1 , 1 + 1 ) ) 

20  CONTINUE 

CALL  GRP LOT  ( X ( 1 , 1 1 ) , 2 • GRDTMP (1,1)) 

CALL  GRP, T01  ( Y ( 1 , 1 1 > , 1 , GRPTMP (1,2)) 

CALL  GRPEOI  ( GRPTMP (1 .2) , 2 , GRDTMP ( 1 , 2 ) ) 

CALI  G Ft  PA  00  (GRPTMP  ( 1 , 1 ) » GRPTMP  ( 1,2)  , GFiPTMP  ( 1 ,2) ) 
C AL L GRPEO I ( OX ( 1 , 1 1 ) , 2 , GRPTMP (1,1)) 

CAL  l GRP A 00  ( GRPTMP (1,2), GRPTMP (1,1), GRPTMP (1,1)) 
CALL  GRPEOI  ( OY ( 1 , 1 1 ) , 2 , GRDTMP (1,2)) 

CALL  GRP A 00  ( GRPTMP (1,1), GRDTMP ( l . 2 ) , GRPTMP (1,2)) 
CALL  GRP MR 0 ( 20 . , GRPTMP ( 1 , 2) , GRPTMP ( 1 , 2 ) ) 

CALL  GRPEOI  ( UO ( 1 , 1 >, 2 , GRDTMP ( 1 » 1 > > 

CALI  GRP A 00  C GRPTMP (1,2), GRPTMP (1,1), GRPTMP (1,1)) 
CA1  L GRPEOI  ( UO ( 1 , 2 ) ,2 , GRPTMP (1,2)) 

CALL  GRP AO 0 ( GRPTMP (1,1), GRDTMP (1,2), GRPTMP (1,2)) 
CALL  GRPEOI  (IJS(  1 , 1 ) ,2, GRPTMP ( 1 , 1 ) ) 

CALL  GRP A 00  < GRP1 MP (1,2), GRPTMP (1,1), GRPTMP (1,1)) 
CALL  GRPEOI  ( IJ8(  1 , 2 ) , 2 , GRPTMP  ( 1 , 2 ) ) 

CALL  GRPAOO  ( GRPTMP (1,1), GRDTMP (1,2), GRPTMP (1,2)) 
CALL  GRPEOI  ( U9 ( 1 , 1 ), 2 , GRDTMP ( 1 , 1 ) > 

CAI  L GRPAOO  < GRP1 MP (1,2), GRPTMP (1,1), GRPTMP (1,1)) 
CAl ! GRPEOI  (1 19  (1.  ,2), 2,  GRPTMP  (1,2)) 

CALI  GRPAOO  ( GRPTMP (1,1), GRPTMP (1,2), GRPTMP (1,2)) 
CAI  L GRPEOI  ( ALFA  , 2 , GRPTMP (1,1)) 

CAI  L G Ft  DP  OR  ( GRPI  MP  ( 1 , 1 ) , 10. 0 , GRPTMP  (1,1)) 

CAI  1 GRPAOO  ( GRPTMP (1,2), GRPTMP (1,1), GRDTMP (1,1)) 
CAL  L GRPEOI  ( X U ,6 ) , 2 , GRPI MP (1,2)) 

CAI  I GRPEO I ( Y( 1 , A ) , 2 , GRDTMP (1,3)) 

CAI  L GRPAOO  ( GRPTMP (1,2), GRPTMP (1,3), GRPTMP ( 1 , 3 ) > 
CAL L GRPPRO  (20.0, GRPI MP (1,3), GRDTMP (1,3)) 

CALL  GRPAOO  ( GRPTMP  (.1,1),  GRPTMP  (1,3),  G Ft  PRES ) 

GO  TO  30000 

C RETURN  COPE 

30000  CONTINUE 

CALI  GRPCPY  ( GFtPItES , GRPRL T ) 

RETURN 

END 
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Appendix  C 


t 


Deck  Arrangement 


The  data  deck  read  by  AUGMENT  has  the  structure  shown  in  X|,e 
following  diagram: 


/•END  | 

I I 

I 

/ Source  Deck  /I 

/ / I 

/ I I 

I I 

I 

/•BEGIN  I 

I I 

I 

/ Description  Deck  /I 

/ /I 

/ " "ll 

I I 

I I 

At  the  conclusion  of  processing,  the  translated  program  decks 
are  in  the  output  file  in  80  column  card  imaoe  format. 


^Tliis  page  is  taken  out  of:  The  AUGMENT  Precompiler  1.  User  Information. 
Tred  Crory  [3]. 
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PROGRESS  IN  THE  CALCULATION  OF  TWO-  AND  THREE-DIMENSIONAL 

BOUNDARY  LAYERS 


Tuncer  Cebeci 

Mechanical  Engineering  Department 
California  State  University  at  Long  Beach 
Long  Beach,  California  90846 


1.  INTRODUCTION.  Boundary  layers,  sometimes  called  thin  shear  layers  [l], 
occur  in  flow  past  bodies  (external  flows),  or  flow  through  ducts  or  channels 
(internal  flows),  at  all  but  very  low  Reynolds  numbers.  The  major  portion  of 
the  flow  can  be  considered  inviscid  since  the  magnitude  of  the  viscous  force 
is  negligible.  However,  owing  to  the  need  to  satisfy  the  no-slip  boundary 
condition  imposed  by  viscous  flow,  the  fluid  must  have  a rapid  variation  from 
the  inviscid  velocity  to  the  zero  value  at  the  wall.  The  region  where  this 
variation  occurs  is  restricted  to  a thin  layer  close  to  the  wall,  of  an  extent 
proportional  to  an  inverse  power  of  the  Reynolds  number,  and  is  called  the 
boundary  layer. 

The  presence  of  the  boundary  layer  on  a body  allows  the  computation  of 
quantities  of  great  importance  in  the  design  of  vehicles,  for  example,  viscous 
drag  and  heat  transfer.  The  phenomenon  of  flow  separation  is  intimately  con- 
nected with  the  processes  occurring  within  the  boundary  layer  where  the  already 
retarded  flow  first  reverses  direction.  Although  all  of  the  desired  results 
may  be  obtained  from  a direct  solution  of  the  Navier-Stokes  equations,  these 
solutions  are  complicated  and  very  costly.  They  aro  also  unnecessary  since 
at  high  Reynolds  numbers  the  equations  can  be  simplified  to  give  the  boundary- 
layer  equations,  which  are  extremely  versatile,  and  encompass  a very  wide 
range  of  flow  situations. 

Since  the  advent  of  the  use  of  finite-difference  methods  for  the  solution 
of  the  boundary-layer  equations,  great  advances  have  been  made  in  both  solution 
techniques,  and  basic  understanding  of  the  flows.  Without  the  use  of  the 
computer,  our  knowledge,  especially  in  the  area  of  turbulent  flows,  would  be 
considerably  less  than  it  is  today.  Most  of  the  previous  work  in  the  field 
has  been  concentrated  on  either  two-dimensional  or  axi symmetric  flows.  Only 
recently  has  there  been  interest  in  three-dimensional  or  unsteady  boundary 
layers,  and  to  date,  not  much  has  been  done. 

The  numerical  methods  used  for  all  these  solutions  fall  into  three  basic 
categories.  Of  historical  interest  are  the  difference-differential,  or  line, 
methods  which  utilize  shooting  techniques  for  their  solution.  These  are  not 
used  much  anymore.  Of  the  two  true  finite-difference  methods,  both  are  implicit, 
with  the  major  distinction  beinq  in  the  number  of  node  points  used  to  form  the 
differences  on  grid  lines,  and  both  can  difference  acrncc  node  Doint.s.  The  RnX 
scheme,  devised  by  H.  B.  Keller  [2],  uses  only  two  point  differences,  thus 
differencing  only  between  grid  points,  where  the  difference  formulas  are  also 
centered.  Both  of  these  schemes  are  in  widespread  use  today,  but  I will 
concentrate  only  on  the  Box  scheme  which  H.  B.  Keller  and  I have  used  with 
success  over  the  past  ten  years  on  a great  variety  of  problems,  see  for 
example,  refs.  3 to  6. 
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As  I previously  mentioned,  standard  two-dimensional  boundary-layer  problems 
have  been  well  documented,  so  I will  report  on  two  areas  of  current  interest 
which  still  leave  many  open  questions.  First,  the  calculation  of  separating 
and  reverse  two-dimensional  flows  using  the  boundary-layer  equations  will  be 
discussed  and  comparisons  with  some  Navier-Stokes  solutions  presented.  Next 
the  three-dimensional  flow  past  wings  will  be  described  and  the  related 
mathematical  problem  of  unsteady  two-dimensional  boundary- layer  flow  will  be 
presented.  Finally,  the  various  models  of  turbulence  will  be  briefly  reviewed 
and  an  example  of  the  use  of  the  model  we  prefer  at  Douglas  Aircraft  Company, 
Long  Beach,  will  be  given,  along  with  some  flows  where  more  complex  models 
are  called  for. 


2.  SEPARATING  AND  REVERSE  FLOWS.  The  boundary-layer  equations  for  two- 
dimensional  incompressible  laminar  and  turbulent  flows  with  the  eddy  viscosity 
concept  can  be  written  as  [l] 


3U  + 3V_ 

3x  3y 

= 0 

0) 

3u  1 c 

£ + L. 

/K  3U\ 

lb  ay) 

(2) 

3y  p c 

ix  3y 

% 

} 3U 

V 3y 

b = (v  + 

(3) 

Here  the  eddy  viscosity  is  defined  by 


-t— r _ 3u 
— pu  V = pe  — 
m 3y 


(4a) 


and  the  pressure  is  related  to  the  velocity  at  the  edge  of  the  boundary  layer 
by 


(4b) 


, . du 

_ld£  = u e 
p dx  e dx 

Equations  (1)  to  (3)  are  subject  to  the  following  boundary  conditions: 

y = 0,  u = 0,  v = vw(x)  ; y = ye  , u = ug(x)  (5) 

The  boundary-layer  problem  as  formulated  by  (1)  to  (5),  where  the  external 
velocity  ^pressure  gradient)  is  prescribed,  is  termed  th^  standard  problem. 

Once  this  is  solved,  quantities  of  importance  can  be  obtained  from  the  solution 
such  as  skin  friction 


“ISk 


or  displacement  thickness, 


c = . 
f 

7 pUe 


ye 


-few 


(6a) 


(6b) 
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It  has  been  found  that  for  the  standard  problem,  as  separation  is  approached,  a 
singularity  develops  in  the  solution  and  the  procedure  cannot  pass  through 
separation  into  the  region  of  separated  (reverse)  flow  [7,8].  Thus,  prior  to 
dealing  with  the  problem  of  integrating  into  a negative  velocity,  a means  must 
be  found  to  enable  one  to  pass  through  the  separation  singularity  in  order  to 
reach  this  reverse  flow  region. 

The  dominant  terms  in  (2)  are  underlined  above,  and  thus  the  equation  is 
classified  as  parabolic.  The  numerical  solution  of  this  equation  together  with 
(1 ) and  (5)  requires  a marching  procedure.  At  stations  A and  E of  Figure  1,  the 


Figure  1.  Flow  with  a "separation  bubble."  Flow  separates  between  A and 
B and  reattaches  at  D. 


velocity,  u,  is  uniformly  positive  across  the  boundary  layer,  and  no  diffi- 
culties are  encountered  in  the  solution  procedure.  However,  at  Stations  B and 
C,  the  velocity  has  a reverse  flow  in  a portion  of  the  boundary  layer,  and  a 
marching  procedure  from  B to  C cannot  be  used  since  the  information  is  being 
propagated  from  C to  B against  €fie  proposed  direction  of  the  march.  The 
numerical  method  will  blow  up,  and  indeed  the  differential  equation  has 
exponentially  growing  solutions  if  marched  into  a region  of  negative  flow. 
Consequently,  something  must  be  done  if  a solution  is  to  be  obtained  in  the 
region  of  reverse  flow. 


This  difficulty  can  be  avoided  by  using  the  so-called  inverse  procedures 
[5-10].  Instead  of  imposing  the  pressure  gradient  as  known  (which  implies 
u known),  some  other  quantity,  such  as  Cf  or  6*,  is  assumed  given  and 
tne  correct  pressure  gradient,  consistent  with  this  quantity  and  the  governing 
equations,  is  deduced.  As  a first  step,  rewrite  the  equations  in  terms  of  the 
stream  function  which  satisfies  the  continuity  equation 


u 


The  momentum  equation  (2)  becomes 


<b*">'  -?af 


ill  _ a"  ii- 
ax  v ax 


(7) 


(8) 
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Here  the  primes  denote  differentiation  with  respect  to  y and  the  parameter 
b is  defined  by 


L 


b = v + cm  (9) 

The  boundary  conditions  (5)  become 

y = o,  = 0;  y = ye  ii'e  = U0(x)  (10a) 

If  6*  is  also  given,  then  from  (6b),  at  y = ye  we  can  impose  another 
boundary  condition  on  the  system,  namely, 


*e  = ue[ye  ~ (10b) 

The  statement  of  the  problem  is  to  find  ii(x,y)  and  its  derivatives,  and  the 
pressure  gradient,  subject  to  the  equation,  (8),  and  boundary  conditions,  (10) 
for  a given  6*(x). 

Keller  and  I have  used  two  approaches  to  solve  the  inverse  boundary-layer 
problem,  see  for  example,  refs.  5,  6,  9,  and  10.  They  are  termed  "the  non- 
linear eigenvalue  method",  and  "the  Mechul*  function  method." 

The  procedure  of  the  nonlinear  eigenvalue  method  [5,10]  is  as  follows. 

For  each  x-location,  assume  a value  of  px,  call  it  6.  Now  solve  the  standard 
problem  using  the  value  of  ue  given  by  this  b,  by  the  Box  method.  From  this 
solution  compute  a value  of  the  displacement  thickness,  call  it  6*.  Compare 
this  with  the  given  value  of  <$*,  i.e., 

$(B)  = 6*  - 6*  (11 ) 

If  B were  chosen  correctly,  then  <p  would  be  zero.  If  it  is  not  zero  to 
within  some  small  tolerance,  a new,  better  guess  for  6 is  required.  This 
is  found  by  finding  the  variation  of  $ with  8,  and  using  a Newton  procedure 
to  predict  a new  value  for  8 


8V+1 


_ <fr(BV) 

|g  U(6V)] 


Using  (11 ),  we  can  write 


3B  36 


(12) 


(13) 


This  requires  the  solution  of  a set  of  variational  equations  obtained  by 
differentiating  the  differenced  equations  of  the  standard  problem  with  respect 
to  6.  The  iteration  procedure  is  continued  until  convergence  is  obtained, 
usually  requiring  no  more  than  two  or  three  iterations. 

The  second  approach  using  the  Mechul-function  method  [6,9],  uses  the 
momentum  equation  as  previously  given,  eq.  (8),  but  treats  the  pressure  as  an 


♦Mechul:  Turkish  for  unknown. 
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unknown  (hence  the  name  of  the  method),  and  adds,  as  an  additional  equation 
to  close  the  set, 


p'  = 0 (14) 

The  boundary  conditions  are  specified  as 

y=0:  ^=^'=0  (15a) 

y = ye:  'P  = ug(ye  - 6*),  ^ ' = ue  (15b) 

The  Box  method  using  Newton's  iteration  is  used  to  solve  this  set  which  again 
converges  in  two  or  three  iterations  at  each  x step. 

Both  of  these  methods  have  been  used  to  solve  various  laminar  and  turbulent 
boundary- layer  problems  where  no  separation  is  present.  The  results  of  each  are 
comparable.  However,  when  a separation  zone  is  encountered,  and  an  attempt  is 
made  to  solve  the  equations,  as  below,  only  the  Mechul  function  method  is 
successful.  Thus,  in  what  follows,  it  is  only  this  method  which  has  been  modified 
to  pass  through  the  separation  point. 

The  major  problem  to  be  overcome  in  the  separation  region  is  the  advance  of 
the  solution  into  the  oncoming  reverse  flow  in  the  boundary  layer.  When  this 
velocity  is  small,  as  it  usually  is  in  limited  separation  bubbles,  an  approxi- 
mation made  by  Reyhner  and  Flugge-Lotz  [11],  called  FLARE  [12],  can  be  used. 

This  consists  of  setting  the  value  of  u equal  to  zero  wherever  u < 0 within 
the  boundary  layer.  Thus,  the  entire  difficulty  of  marching  into  an  oncoming 
flow  is  eliminated.  Incorporating  this  approximation  into  the  inverse 
procedures,  allows  the  computation  of  separated  flows. 

The  results  of  such  calculations  are  given  in  Figure  2.  Here  a case 
computed  using  the  Navier-Stokes  equations  [13]  was  recomputed  using  Mechul  with 


Figure  2.  Calculated  local  skin-friction  coefficient  distribution  for  separating 
and  reattaching  flow  computed  by  Briley.  The  present  method  denotes 
the  solutions  obtained  by  Mechul  [9]. 
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FLARE  [9],  The  prescribed  value  of  6*  is  shown,  along  with  a comparison  of 
the  computed  values  of  cf.  The  agreement  is  obviously  very  good.  As  a 
further  demonstration  of  the  ability  of  the  method,  another  separated  flow 
previously  computed  by  a boundary- layer  technique  [14]  was  recomputed  using 
Mechul -FLARE.  The  original  calculations  were  also  made  with  the  FLARE 
approximation,  but  using  a modified  set  of  equations  for  an  inverse  boundary 
layer  [14].  These  are  shown  in  Figure  3 labelled  forward  marching.  A further 


(a)  (b) 

Fiqure  3.  Comparison  of  calculated  results  for  the  flow  computed  by  Carter. 

The  present  method  denotes  the  solutions  obtained  by  Mechul  [9]. 
(a)  Local  skin  friction  distribution,  (b)  Streamline  pattern  in 
separation  bubble. 


improvement  over  FLARE  can  be  made  once  a complete  pass  through  the  separated 
region  is  finished.  With  the  computed  velocities  obtained  from  FLARE,  the 
convection  terms  involving  u can  be  differenced  with  the  prevailing  wind  of 
the  previous  pass,  and  an  iteration  can  be  performed  until  the  results  do  not 
change  [9,14].  These  are  shown  labelled  as  global  iteration  in  Figure  3,  and 
should  represent  the  true  flow  more  accurately  since  the  physical  situation 
is  modelled  somewhat  better  than  FLARE.  Results  of  our  solution  procedure  are 
also  given,  and  it  can  be  seen  that  the  minimum  shear  agrees  quite  well  with 
the  previous  global  iteration  with  the  results  being  generally  between  the  two 
previous  calculations.  I am  presently  investigating  the  use  of  this  global 
iteration  procedure  and  other  approaches  within  the  framework  of  the  Mechul 
method.  The  results  will  be  reported  in  a forthcoming  report. 

3.  THREE-DIMENSIONAL  AND  UNSTEADY  BOUNDARY  LAYERS.  One  of  the  most 
important  problems  we  face  today  is  the  calculation  of  three-dimensional 
boundary  layers.  The  major  application  of  the  methods  we  use  is  to  the 
calculation  of  three-dimensional  compressible  laminar  and  turbulent  boundary 
layers  on  arbitrary  wings  and  bodies  of  revolution  at  incidence.  One 
obstacle  that  had  to  be  overcome  prior  to  other  considerations,  for  the 
calculation  of  boundary  layers  on  wings,  was  the  choice  of  a coordinate 
system  in  which  to  perform  the  boundary  layers  calculations.  The  one  we 
finally  chose  as  being  the  most  appropriate  for  wings  is  the  body-oriented 
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nonorthogonal  system  shown  in  Figure  4.  Using  this 
can  easily  be  covered  with  grid  points  which  follow 
wing  [15].  Methods  that  use  orthogonal  systems  are 
as  discussed  in  ref.  15. 


system  the  entire  wing 
the  exact  plan  form  of  the 
not  very  suitable  for  wing; 


Figure  4.  The  nonorthogonal  system  used  by  Cebeci , KauDS  and  Ramsey  for 
wing  calculations. 


The  governing  boundary-layer  equations  for  three-dimensional  compressit 
laminar  and  turbulent  boundary  layers  for  a nonorthogonal  system  can  be  written 
as  [15]: 

Continuity  equation 


3X 


(puh?  sine)  + (pwh1  sine)  + |y  (pvh^  sine)  = 0 


(16) 


x-Momentum  equation 


^fx  + pl^iz  + pViy'_p  cote  Klu^  + pcsc0  K2w^  + pK12uw 

cstre  ap  . cotfl  csce  ap  , a / au  -r-r^  , 

— Fq — a? + — r; — » + »7  57'°“ v ' (m 

z-Momentum  equation 

pK7lr+p^3?+p7ly"'p  cote  K2w?  + pcsce  Klu2  + pK21uw 

? 

_ cote  csce  ap  esc  o ap  , a / aw  — r- r\ 

Fj si  + ly(v  3y~pw  v ) 
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K1 2 ' ITS?  [-(K>  * ^ t)  + C°Se  (K2  * k «)]  1223  ’ 

K21  ■ 5TOT  [-(k2  * H)  * cose  (K1 + K7  i)]  <22b) 

ut  represents  the  total  velocity  within  the  boundary  layer  and  is  given  by 

u^.  = (u2  + w2  + 2uw  cose)'5  (22c) 

The  boundary  conditions  for  the  system  of  equations  (16)  to (19)  are: 


y = 6:  u = ue(x,z),  w = wg(x,z)  H = Hg 


In  addition  to  closure  assumptions  for  the  Reynolds  stresses  -pu’v1 , 

— p v 1 w 1 and  -pv 1 H1 , equations  (17)  to (19)  also  require  initial  conditions 
on  two  intersecting  planes.  One  of  these  fs  the  y-z  plane  along  the 
initial  x-station  where  data  is  first  prescribed,  while  the  other  is  an 
x-y  plane  away  from  which  the  subcharacteristics  of  the  hyperbolic  crossflow 
allow  one  to  march  the  equations.  In  fact,  it  is  the  presence  of  the  cross- 
flow  terms  in  the  equations  which  distinguishes  the  calculation  of  three- 
dimensional  boundary  layers  from  their  two-dimensional  counterparts. 
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The  boundary-layer  equations  as  given  by  (16)  to  (19)  can  be  solved  when 
they  are  expressed  either  in  physical  coordinates  or  in  transformed  coordinates. 
Each  coordinate  has  its  own  advantages.  In  three-dimensional  flows,  where  the 
computer  storage  and  time  becomes  quite  important,  the  choice  of  using  trans- 
formed coordinates  becomes  necessary  as  well  as  convenient  because  the  trans- 
formed coordinates  allow  large  steps  to  be  taken  in  the  streamwise  and  spanwise 
directions.  In  addition,  they  remove  the  singularity  at  x = 0 and  z = 0. 

A convenient  and  useful  transformation  for  three-dimensional  boundary 
layer  flows  is  aiven  in  ref.  15.  According  to  this  transformation,  which  is 
a generalization  of  2-d  Falkner-Skan  transformation  Cl]  to  3-d  flows,  we  first 
define  the  transformed  coordinates  by 


x = x. 


z = z 


dn 


S1  = 


and  introduce  a two-component  vector  potential  such  that 
puh2  sine  = , pwh-j  sine  = |^- 

^lh2  sin9  = “(H  + H) 


h^dx 


(24) 


(25) 


In  addition,  we  define  dimensionless  \p  and  $ by 

= (eeueuesl  ^ "h2  sine  f(x»z*n) 

<t>  = KfVeVl^ref^e  ^1  sine  9(x,z,n) 


(26) 


Using  these  transformations  and  the  concepts  of  eddy  viscosity  and  turbulent 
Prandtl  number  it  can  be  shown  that  the  x,  z-momentum  equations  and  the  equation 
and  their  boundary  conditions  in  transformed  variables  can  be  written  as: 

x-Momentum 

(bf")’  + rr^ff"  - m2(f'  )2  - m5f'g'  + m6f"g  - mg(g'  )2  + m^c 


= m10 


af ' 
3x 


(27) 


z-Momentum 

(bg")'  + m^fg"  - m4f'g'  - m3(g 1 )2  + m6gg"  - m9(f' )2  + m]2c 
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Energy 


* 


(ii-|E' )'  + ^2^'  + U3  _ m 

Boundary  Conditions: 
n = 0 : f 

n = n : f 
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w 
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Here  primes  denote  differentiation  with  respect  to  n and 


f‘  = u/ue. 


9'  = w/urfif. 


b = C(1  + e!),  c = ~ , c = , 

m p u p 

^eMe  M 


E = H/H 


£ = £ / V 

m m 


(29) 

(30) 


(31) 


The  coefficients  m,  to  m,2  are  functions  of  external  velocity  distribution, 
the  coordinate  system  and  fluid  properties  as  described  in  ref.  15.  The  formulas 
for  e+  are  also  described  in  ref.  15. 


To  solve  the  system  given  by  (27)  to  (30)  by  the  Box  method,  we  use  the  net 
cube  shown  in  Figure  5.  Here  the  equations  are  centered  by  the  midpoint  of  the 
cube  (see  ref.  16).  With  the  solution  obtained  along  the  stagnation  line  at  the 


•(j)  (j.n.i) 


Figure  5.  The  net  cube  for  three-dimensional  flows. 


leading  edge,  all  that  remains  is  the  specification  of  the  integration  direction 
in  z.  This  cannot  be  done  arbitrarily  by  stating  that  the  integration  will 
always  begin  at  the  wing  root,  after  the  solution  of  some  appropriate  equations 
there,  and  proceed  towards  the  tip.  It  is  possible,  and,  in  fact,  we  have  cases 
where  it  was  required,  that  the  integration  proceeds  from  tip  to  root  [15].  I 
will  return  to  the  details  of  this  very  important  point  later. 
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For  the  initial  work  we  i id,  in  , the  >•  ,i  »n  igure  5,  the  sign 
of  the  external  veloi  it-.,  w . • lijectioi  of 

integration.  For  w 0 tn.  prc  edure  ■ i ' • . • r “■  nm  root  to  tip,  however, 
if  w < o,  the  intlgi  ■ n is  • starting  solution  in 

the  pTane  of  the  tip  is  simply  ai  approxii  ition  to  lation;  there. 

One  test  case  we  calculated  had  • • * ; >n  !•  win,,  see  Figure  6. 


ROOT 


T 


ik 


i 

♦ 

i? 

4 


* 

<j> 

T 


r 


/ 


T--_ 


♦ 

* 


[ 

z 


— wp  = ( -f  ) 
*e  = (-) 


STAGNATION  LINE 


- 

T 


•: 


d, 

' 


\1 


3 

TIP 


TRAILING  EDGE 


* REGION  I 
o REGION  2 
o REGION  3 


Figure  6.  Definitions  of  various  regions  on  the  wino  u^ed  for  marching  procedure. 


As  shown  in  this  figure  the  direction  of  integration  is  out  from  root  to  tip, 
then  in,  and  finally  out  again.  Using  this  very  comp  icated  logic,  for  the 
external  velocity  shown  in  Figure  7,  we  obtained  the  solution  of  the  governing 
equations  without  any  numerical  difficulties.  Figure  8 shows  a comparison  of 
calculated  and  experimental  result  • and  lip,  o J - w the  computed  cross-flow 
velocity  profiles  for  this  case.  Since  this  particular  wing  was  of  a special 
nature,  the  three-dimensional  solution  could  be  coirpa  ed  with  an  infinite  swept- 
wing  solution  obtained  using  an  orthogonal  system  by  simply  using  a coordinate 
rotation.  The  agreement  was  excellent. 
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300 


Figure  7.  Experimental  velocity  distribution  for  the  data  of  Brebner  and 
Wyatt;  nonorthogonal  system. 


(U2  + Wa) 

Uoo 


(a) 


(b) 


Figure  8.  Comparison  of  calculated  results  with  experiment  for  the  data  of 
Brebner  and  Wyatt,  (a)  Velocity  profiles,  (b)  Cross-flow-angle 
distribution. 
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FLOW  BECOMES 


Figure  9. 


Computed  cross-flow  velocity  profiles  at  different  chordwise 
stations  for  the  data  of  Brebner  and  Wyatt,  nonorthogonal  system. 


Numerous  other  test  cases  for  wings  were  performed,  including  one  for 
which  an  isolated  "island"  of  wg  < 0 existed  on  the  wing,  see  Figure  10, 


NZ*1 


Figure  10.  Finite  wing.  The  symbols  denote  the  stations  where  the  boundary- 
layer  calculations  are  made.  Dots  correspond  to  stations  where 
wg  is  positive  and  x's  correspond  to  negative  wg. 


315 


not  connected  to  the  root  01  tl  < e still  < ivered  the  entire 

wing  surface,  and  generated  good  i t-si.  ts  Nonetheless,  we  felt  that  our 
met!  ol  determining  th<  inexact,  and  continued 

to  seek  better  alternatives.  imulated  b our  work  on 

another  three-dimensicna i be ur  la>  /- layt  problem,  the  flow  past  a body  of 
revolution  at  incidence  [16].  This  configuration  produces  a region  of  reverse 
crossflow,  due  to  an  adverse  pressure  gradient,  but  without  a change  of  sign 
of  the  crossflow  edge  velocity  Thus,  the  method  used  on  the  wing,  where  the 
crossflow  integration  direction  is  changed  when  t hi:  edge  velocity  changes  sign, 
cannot  be  used.  The  attempt  to  integrate  into  an  increasingly  negative  flow 
ultimately  caused  the  method  to  break  down  apparently  due  to  a violation  of 
the  stability  restriction  of  the  difference  scheme  shown  in  figure  5. 

A possible  solution  to  this  crossflow  problem  can  oe  found  by  the  use  of 
the  differencing  shown  in  Figure  !1,  called  :ig-  ag  differencing.  This  allows 
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Figure  11.  Standard  box  and  zig-zag  different  ing  schemes  for  the  unsteady 
flow  problem  used  by  Cebeci  IIP]  for  3-D  steady  flows,  t is 

replaced  by  x,  and  x by  z. 


a greatei  degree  of  negative  crossflow  than  our  cube  by  virtue  of  a less 
restrictive  stability  requ  lenl  [17].  Some  of  the  previous  wing  calculations 
were  redone  with  this  incoi  orated  k pr<  (ram,  and  the  results  were 

basically  the  same  as  before,  except  that  some  small,  unnoticed  ripples  in  the 
solution  which  had  been  present  were  eliminated.  However,  this  seemed  to  be 
just  a e used,  and  did  not  really  address 

the  physical  process  of  the  crossl  111  istrate  our  current  method, 

consider  a simplified  form  of  equations  (17)  and  (18). 
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(32b) 
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The  problem  is  that  u and  w continuously  change  through  the  boundary  layer, 
as  well  as  across  the  wing  at  the  boundary-layer  edge;  the  normal  velocity,  v, 
does  not  concern  us  here.  Somehow  these  changes  must  be  taken  into  account  so 
that  we  always  march  in  a direction  for  which  the  diffusion  problem  is  well 
posed.  With  this  in  mind,  we  can  look  at  the  equations  as  expressing  the 
variation  of  quantities  along  characteristic  directions: 
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We  approximate  the  equations  by  differencing  backward  along  the  characteristic 
direction  which  exists  at  each  different  y-level , see  Figure  12.  Thus,  this 


Figure  12.  The  new  differencinq  scheme  for  three-dimensional  flows.  The  closed 
symbols  denote  the  computed  solutions.  The  marching  procedure  is  in 
the  spanwise  direction  for  a given  chordwise  station. 


method  uses  the  notion  of  domains  of  dependence  more  carefully,  and  follows  the 
characteristics  of  the  locally  plane  flow.  Our  calculations  using  this  method 
on  the  wing  again  have  reproduced  our  original  calculations  without  any  oscilla- 
tions, and  have  greatly  reduced  the  logic  of  the  program  allowing  the  calculation 
to  proceed  from  root  to  tip  at  every  x-station.  We  feel  that  this  scheme 
represents  the  most  realistic  and  accurate  way  to  compute  three-dimensional 
boundary  layers  on  arbitrary  wings. 

Another  area  where  these  same  considerations  arise  is  in  unsteady  two- 
dimensional  boundary  layers.  The  governing  equations  for  laminar  flow  are: 
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They  are  subject 

to  the 

boundary  conditions 

y = 0: 

u 

=v=0;  y = 6 u = ug(x,t) 

(36) 

and  an  appropriate  initial  condition. 


I will  illustrate  the  properties  of  these  types  of  flows  by  considering 
the  flow  over  a circular  cylinder  started  inclusively  from  rest  [18].  In 
this  flow  at  certain  times,  the  streamwise  velocity  profile  contains  regions 
of  backflow.  Thus,  methods  used  for  two-dimensional  separated  flows,  and 
three-dimensional  flows  with  reverse  crossflow  should  be  applicable  here. 

The  flow  development  proceeds  as  follows.  For  some  time  after  the  impulsive 
start  at  t = 0,  all  the  flow  is  from  forward  to  rear  stagnation  point,  and 
so  the  skin  friction,  c.,  is  positive.  At  time  t = 0.32,  cf  becomes 
negative  at  an  interior  point  on  the  cylinder,  and  then  this  cl  = 0 point 
progressively  moves  upstream  until  it  finally  reaches  a value  of  x = 105° 
at  t = 1.25.  This  value  of  x is  quite  close  to  the  value  computed  by  a 
steady-state  analysis.  However,  the  calculation  assumes  the  flow  to  be 
unsteady,  and  models  the  vortex  shedding  phenomenon  in  some  sense.  Here 
the  boundary- layer  assumptions  are  no  longer  valid  and  the  calculation  finally 
breaks  down. 

How  were  the  calculations,  on  which  this  description  is  based,  performed? 
I first  used  the  standard  Box  method,  with  the  net  cube  shown  in  Figure  5 used 
for  differencing,  to  calculate  the  flow  field  for  values  of  t < 0.8.  In  this 
way  I studied  how  far  one  can  compute  the  flow  field  with  and  without  backflow 
before  the  solutions  blow  up.  This  standard  procedure  was  able  to  calculate 
the  flow  field  for  all  x up  to  and  including  t = 0.6.  At  the  next  time 
interval  no  trouble  was  encountered  up  to  x = 152°.  Although  convergence 
was  obtained  at  the  next  two  x-stations,  the  asymptotic  behavior  at  the  edge 
of  the  boundary  layer  was  not  correct.  Finally,  at  x = 156°  the  solutions 
diverged.  With  increasing  time,  the  last  "good"  station  moved  forward  so 
that  it  was  at  x = 136°  at  t = 0.8.  Next  I used  the  zig-zag  scheme  of 
Figure  11  to  compute  the  flow.  For  all  practical  purposes,  the  results 
obtained  earlier  by  the  standard  Box  scheme  agreed  with  those  obtained  by  the 
new  procedure  except  now  the  solutions  did  not  break  down  at  those  x-stations 
previously  mentioned,  and  the  calculations  were  performed  up  to  x = 180°  for 
values  of  t up  to  0.80.  For  values  of  t > 0.80,  the  solutions  began  to 
develop  oscillations  in  regions  of  backflow.  To  remedy  this,  I used  the 
procedure  described  as  global  iteration  in  the  separated  flow  section,  and 
made  two  passes  in  x for  each  time  level,  differencing  with  the  wind  in 
regions  of  backflow.  In  this  way  the  calculations  were  performed  until 
t = 1.25,  and  values  of  skin  friction  and  displacement  thickness  as  well  as 
velocity  profiles  were  computed  without  any  difficulty.  Figure  13  shows  the 
computed  skin  friction  and  displacement  thickness  distributions  around  the 
circular  cylinder  for  various  values  of  t.  I am  presently  investigating 
better  procedures  for  this  flow  in  order  to  extend  the  calculations  to  large 
times,  hopefully  to  t = «.  The  results  will  be  reported  in  a forthcoming 
report. 
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Figure  13.  Computed:  (a)  Local  skin-friction  coefficients,  and  (b)  Displacement 
thickness  distributions  for  the  unsteady  flow  problem  discussed 
in  [18]. 


4.  TURBULENCE  MODELLING.  I would  like  to  conclude  with  a brief  summary 
of  the  types  of  models  being  used  in  turbulent-flow  calculations.  In  general, 
the  prediction  methods  for  turbulent  boundary  layers  can  be  divided  into 
integral  and  differential  methods.  The  integral  methods  involve  the  integral 
parameters  of  the  boundary  layer  (e.g.,  displacement  thickness,  local  skin 
friction,  shape  factor,  etc*. ).  They  avoid  the  complexity  of  solving  the 
boundary- layer  equations  in  the  partial-differential-equation  form  and  instead 
solve  a system  of  ordinary  differential  equations.  For  further  details,  see 
references  19,  20. 


Differential  methods  involve  direct  assumptions  for  the  shear  stress 
-pu' v'  and  seek  the  solution  of  the  governing  equations  in  their  partial- 
differential  -equationform.  The  assumption  can  either  lead  to  an  algebraic 
relation  between  -pu ' v ' and  3u/3y  or  it  can  take  the  form  of  a partial- 
differential  equation.  The  methods  that  use  the  first  assumption  are  often 
called  mean-velocity  methods;  those  that  use  the  second  assumption  are  called 
transport-equation  methods. 


In  the  mean-velocity  methods,  we  observe  two  main  assumptions.  In  one 
tion,  -pu'v'  is  related  to  3u/3y  by  usinq  the  so-called  mixing- 
qth  concept  first  proposed  by  Prandtl . According  to  that  concept,  -pIPv 
to  be  calculated  from 


-purvT 


p i 


3u  3U 

3y  3y 


(37) 


the  other  assumption,  -pu' v'  is  related  to  3u/3y  by  using  the  so-called 
Idy-viscosity"  concept,  in  which  case  -pu'v'  is  calculated  from 


-pu 1 v 1 


3U 
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(38) 


The  transport-equation  methods  consider  the  rate  of  change  of  the  Reynolds 
losses  and  are  more  accurate  general  methods  than  the  mean  velocity  methods, 
equation  for  -pu 1 v ' contains  the  mean  velocity  gradient,  one  or  more  of 
•o  unknown  normal  stresses,  and  further  unknown  turbulence  terms  that  can  be 
Mated  empirically  by  using  experimental  data  and  theoretical  ideas  about 
e behavior  of  the  turbulence.  Again,  the  unknown  turbulence  quantities  are 
md  empirically.  That  approach  obviously  has  the  potential  that,  as  the 
Savior  of  turbulence  is  understood  more,  one  can  make  more  plausible  assump- 
h>ns  for  the  unknown  turbulence  quantities,  leading  to  more  accurate  prediction 
fhods  that  are  valid  for  a large  range  of  flows. 


Both  the  assumption  for  -pIT1  v ' given  either  by  (37)  or  by  (38)  can  be 
eu  in  a mean-velocity  method  such  as  the  one  which  involves  the  solution  of 
nations  given  by  (1)  to  (3).  It  is  necessary  to  know  the  distribution  of 
and  s:  across  the  boundary  layer.  According  to  various  studies  (see  refs. 

■,  20)  the  turbulent  boundary  layer  is  regarded  as  a composite  layer 
isisting  of  inner  and  outer  regions,  and  the  distributions  of  ft  and  em 
■ described  by  two  separate  empirical  expressions  in  each  region.  For 
miple,  if  the  viscous  sublayer  close  to  the  wall  is  excluded,  t is  propor- 
nal  to  y in  the  inner  region,  and  it  is  proportional  to  6 in  the  outer 
Ion.  Therefore, 


y0  iy  iyc  (39a) 


yc<y<ye  (39b) 

e,  \s  a sma^  distance  from  the  wall,  a distance  approximately  equal  to 
(iw/p )'s,  and  yc  is  another  distance  obtained  from  the  continuity  of 
ing  length.  The  empirical  parameters  < and  a,  vary  slightly  according 
experimental  data.  For  flows  at  high  Reynolds  numbers  (Re  > 5000),  they 
generally  taken  to  be  < = 0.40  and  ai  = 0.075.  For  further  details. 

Similarly,  according  to  various  studies,  cm  varies  linearly  with  y in 
inner  region  and  is  nearly  constant  in  the  outer  region.  Its  variation 
■iss  the  boundary  layer  can  conveniently  be  described  by  the  following 

mul as : 
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with  £ given  by  (39a).  The  parameter  a is  generally  assumed  to  be  a 
universal  constant  equal  to  0.0168  for  R0  > 5000. 

There  have  been  numerous  studies  on  extending  (39)  and  (40)  to  include  the 
viscous  sublayer.  By  an  analogy  with  the  laminar  flow  on  an  oscillating  plate. 
Van  Driest  suggested  modifying  (39a)  to 

l = <y[l  - exp(-y/A)]  (41) 

for  a smooth  flat-plate  flow.  Here  A is  a damping-length  constant^ for  which 
the  best  dimensionally  correct  empirical  choice  is  about  A+v(tw/p)"1's  with  A+ 
denoting  an  empirical  constant  equal  to  26. 


Needless  to  say,  the  eddy  viscosity  and  mixing-length  formulas,  like  most 
(if  not  all)  expressions  for  turbulent  flows,  are  empirical.  Over  the  years, 
several  empirical  corrections  to  these  formulas  have  been  made,  to  account  for 
the  effects  of  low  Reynolds  number,  transitional  region,  compressibility,  mass 
transfer,  pressure  gradient,  and  transverse  curvature.  See  the  discussion  in 
ref.  19  for  complete  details. 


The  transport  equation  methods,  in  general , use  two  different  approaches 
to  model  the  Reynolds  shear  stress  -pu ’ v1 . These  are  Two-Variable  Models, 
and  Reynolds  Stress  Models.  Here  we  shall  discuss  only  the  more  commonly 
used  one,  namely  the  Two-Variable  Model.  Methods  based  on  two-variable  models 
use  the  turbulent  energy  equation,  which  for  two-dimensional  incompressible 
turbulent  flows  is 


(42) 


In  general,  the  methods  consist  of  two  types.  One  type  uses  the  value  of 
q from  (42)  to  form  an  eddy  viscosity  em.  Since  eddy  viscosity  is  the 
product  of  a velocity  and  length. 


em  'v  velocity  x length 

and  these  methods  define  by 

m 

em  = cl^  (43) 

Here  i is  a turbulence  length  scale,  and  c-j  is  a constant  at  high  Reynolds 
number  of  turbulence,  Rt  = qt/v.  In  the  current  calculation  methods,  i is 
specified  algebraically  or  through  a differential  equation.  When  i is 
specified  algebraically,  the  prediction  method  requires  the  solution  of  three 
partial-differential  equations,  (2),  (3),  and  (42)  with  suitable  assumptions 
for  the  unknown  turbulence  quantities  appearing  in  (42).  When  is  specified 
by  a differential  equation,  the  prediction  method  requires  the  solution  of  four 
partial-differential  equations,  (2),  (3),  (42),  and  a length-scale  equation. 
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A second  type  of  method,  advocated  by  Bradshaw  and  his  associates,  see, 
for  example,  refs.  21  and  22,  uses  the  turbulent  energy  equation  to  form  a 
relation  for„the  Reynolds  shear  stress.  Bradshaw  relates  the  Reynolds  shear 
stress  to  q by 


Here 


-u  v - a i ^ 

a^  is  a universal  constant  assumed  to  be  0.15. 


(44) 


Both  types  require  closure  assumptions  for  the  unknown  turbulence  quantities 
appearing  in  (42).  However,  the  assumptions  for  those  quantities  do  not  differ 
from  one  type  to  another,  once  the  decision  of  how  to  utilize  the  turbulent 
kinetic  energy  equation  is  made.  For  example,  the  eddy-viscosity  methods  model 
the  production  term  as 

v2 
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The  Bradshaw- type  methods  model  it  as 
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(46) 
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Both  approaches  model  the  dissipation  term  as 

3 

<t,  = c —3 

$ 2 i. 

A practical  but  not  necessary  difference  between  the  two  approaches  comes  in 
modeling  the  "diffusion"  term,  tddy-viscosity  methods  model  it  by  relating 
it  to  the  gradient  of  q2  in  the  form 


_ /£y!  + = c E 1. 

\ p 21  c3  m 3y 


(#) 


(48) 


where  Co  • is  a constant  or  a specified  function.  The  Bradshaw-type  methods 
model  it  as 


= 6 P q 


Here  G is  a constant  or  a specified  function,  and 
characteristic  of  the  large  eddy  motions. 


(49) 

Q is  a velocity  scale 


It  should  be  noted  that  the  closure  assumption  for  the  diffusion  term 
is  of  considerable  importance.  With  the  assumption  in  (48),  equation  (42) 
is  parabolic;  and  with  (49),  equation  (42)  is  hyperbolic.  Of  the  two-variable 
methods,  Bradshaw's  method  has  been  extensively  used  for  a wide  range  of  flow 
conditions  by  himself  and  by  others,  see  ref.  [20]. 

As  an  example  of  the  complex  problems  which  can  be  solved  using  the 
relatively  simple  mean  flow  models,  we  have  computed  the  entry  flow  in  a pipe, 
see  figure  14.  Here  there  are  two  distince  regions  of  the  flow,  with  an 
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Figure  15.  Entry  flow  probl err.  in  a circular  pipe.  ® denotes  the  approxi- 
mate location  where  shear  layers  merge.  @ denotes  the  region 
where  the  flow  is  fully  developed. 


overlapping  zone  between  them.  Near  the  entrance  of  the  pipe  the  flow 
behaves  like  an  external  flow,  x < x-.,  and  after  some  entry  length,  it 
becomes  fully  developed,  x > ana  has  all  the  attributes  of  an  internal 
flow.  It  is  known  that  the  eddy-viscosity  formulations  in  both  regions  are 
totally  different. 

If  the  flow  development  is  computed  using  either  one  of  the  two  eddy 
viscosities  separately,  and  compared  with  experimental  data,  the  result  is 
as  shown  in  Figure  15a.  The  eddy  viscosity  formulation  used  for  external 


(a)  (b) 

Figure  15.  Comparison  of  calculated  and  experimental  centerline  velocity 
distributions  for  the  entry  flow  problem  using  three  different 
eddy  viscosity  formulations,  (a)  — • — denotes  results  obtained 
by  (40),  - ••  — denotes  results  obtained  by  using  a mixing-length 

distribution  for  fully  developed  flows,  (b)  denotes  results 

obtained  by  (50).  The  data  is  due  to  Dean. 

flows  produces  results  that  are  in  good  agreement  with  experiment  for  x < x>| , 
and  the  mixing  length  formulation  used  for  fully  developed  internal  flows 
(see  ref.  1,  for  example)  produces  results  that  are  in  good  agreement  with 
experiment  for  x > x^.  However,  the  overall  agreement  is  poor.  Our  solution 
was  to  model  the  entire  flow  situation  based  on  our  knowledge  of  the  physics. 

We  prescribe  a composite  eddy  viscosity  which  changes  smoothly  from  an  external 
( e-j ) to  an  internal  (£2)  model  as 
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where  x0  is  the  length  down  the  pipe  where  the  two  "external"  boundary  layers 
merge,  and  the  denominator  giving  the  relaxation  scale  was  prescribed  by  divine 
intervention.  Using  this  model,  the  results  of  Figure  15b  were  produced. 

Hence,  the  simple  models  can  be  used  to  predict  some  complex  flows  by  using  a 
great  deal  of  empiricism.  For  details  see  ref.  23. 

There  are,  however,  some  problems  where  the  eddy-viscosity,  mixing-length 
formulations  may  not  work.  One  example  is  a turbulent  flow  near  the  trailing 
edge  of  bodies  (Figure  16)  where  the  multiple  structure  within  the  boundary 
layer  imposes  additional  scales  on  the  turbulence  field  which  may  not  be  handled 
by  simple  models.  Another  example  occurs  in  separating  and  reversed  flows.  In 
reverse-flow  regions,  the  velocity  profile  (see  Figure  1)  must  have  8u/8y  2 0 
within  the  u < 0 zone.  According  to  simple  eddy-viscosity  models,  i.e. 
eq.  (40a),  the  turbulent  shear  stress  vanishes  at  this  point.  This  is  known 
to  be  an  unrealistic  result  and  so  turbulent  separating  flows  are  an  area  where 
both  the  mathematics  and  the  physics  of  the  problem  need  more  investigation. 


Figure  16.  Flow  near  the  trailing  edge  of  a body.  Due  to  the  discontinuity 
in  the  boundary  conditions,  the  flow  has  a double  structure  as  it 
leaves  the  trailing  edge  and  goes  into  the  wake.  I and  II 
refer  to  two  separate  reqions  in  which  the  governing  equations 
are  expressed  in  different  scaling  variables. 
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APPROXIMATION  WITH  VPB  SPLINES 
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ABSTRACT.  This  article  makes  an  extension  of  the  interpolating  VP 
(variable  power)  splines  defined  in  [6,7]  to  define  approximating  VPB 
(variable  power  basic)  splines.  VPB  splines  are  of  a more  general  nature 
than  VP  splines  because  they  may  be  conveniently  used  for  linear 
approximation  in  addition  to  interpolation.  VPB  splines  will  also 
reduce  to  C2  cubic  B splines  as  a special  case. 

1.  INTRODUCTION.  The  variable  power  basic  (VPB)  splines  intro- 
duced here  are  an  outgrowth  of  the  ideas  contained  in  [6,7]  with  an 
assist  from  the  very  readable  book  by  Prenter  [4].  The  articles  on 
VP  splines  were  written  with  the  objective  of  providing  and  making  use 
of  a relatively  simple  interpolant  to  which  one  could  apply  a certain 
amount  of  local  control  without  losing  C2  smoothness. 

B splines  have  been  used  extensively  for  solving  linear  approxi- 
mation problems  [1,2],  but  in  order  to  make  the  best  use  of  B splines 
in  least  squares  curve  fitting,  one  must  consider  the  nonlinear  problem 
of  optimal  knot  placement  [3,  5].  This  problem  can  involve  a consid- 
erable amount  of  computation  if  the  number  of  knots  is  large.  For  this 
reason,  we  will  consider  only  fixed  knots  here.  Although  the  knots  will 
be  fixed,  we  will  attempt  to  select  them  in  an  intuitively  reasonable 
manner  which,  in  addition  to  the  setting  of  the  nonlinear  parameters  of 
the  VPB  splines,  will  afford  us  fairly  appealing  data  fits. 

2.  SOME  BASIC  VP  SPLINE  FORMULAS.  The  tridiagonal  linear  system 
of  equations  producing  second  derivative  continuity  of  a VP  spline 
interpolant  is  given  by: 


(mj-DyJ  + y'2  = mjqj 

Vi-i  * Biyi  * Vi«  ’ DiVi  * EiV1<1<N! 

yN-i  * !nN-l'’'yN  ‘ '«-Iql|J 

where 

Ai  = m1-i(m1-r1)/(k1-i*1-i) 
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c1  = n1  (n1-l  )/(k1«-1 ) 

B1  = (n1-i'1)A1  + (mr1)ci 

°(  = v/i 

E1  - m,C, 

and 

*1  = x1+i  ‘ X1 

^ » (y1+1  - y, )fi\ 

k1  “ m1  + ni  “ mlni 

The  nonlinear  parameter  vectors  m and  n may  be  set  as  in  [6,7] 
after  the  knots  have  been  selected  and  local  smoothing  has  been  applied 
to  produce  corresponding  y values.  A rule  for  setting  m and  n is  given 
by: 


Ri = 

n-|/m^. 

-l 

(*l/*1- 

i)2(Si_I/Si) 

m1-l 

= L and  n^ 

= LRi 

if  Ri  > 1 

n1  = 

L and 

mi-i 

* L/Rl 

if  Rj  < 1 

for  2 

< 1 

< N-l 

"l  = 

L and 

mN-l 

= L 

where  L (>2)  Is  a lower  bound  on  all  m,  and  n*.  S,  Is  the  chord  lenqth 
on  the  1th  knot  subinterval.  ' 
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3.  KNOT  SELECTION.  A fairly  good  set  of  knots  may  be  selected 
using  the  following  process.  First,  the  data  should  be  given  some 
preliminary  local  smoothing  by  averaging  to  remove  high  frequency 
ringing.  A few  applications  of  the  following  formula  should  be 
adequate. 

*i,s  c ^i^i-i+yi) +Vi(yi+yi+i)]/[2(ti-i+ti)3 

We  may  obtain  our  set  of  knots  by  selecting  a subset  of  the  data 
abscissas.  We  begin  our  set  with  the  first  and  last  abscissas,  defining 
the  first  subinterval.  Next,  we  examine  the  area  of  the  triangle 
formed  by  the  smoothed  y values  at  the  interval  endpoints  and  each 
point  within  the  interval,  taking  the  abscissa  corresponding  to  the 
largest  triangle  as  the  next  knot,  defining  two  more  subintervals  (and 
eliminating  one).  Each  subinterval  therefore  has  associated  with  It, 
a weight  equal  to  this  largest  Inscribed  triangular  area.  Continuing 
to  split  subintervals  with  the  largest  weight,  we  may  quickly  pick  out 
the  major  peaks  and  valleys  in  the  data.  We  may  stop  adding  knots 
after  a few  consecutive  additions  fail  to  violate  local  monotonicity, 
because  knots  near  the  major  peaks  and  valleys  will  then  have  been 
established.  The  following  diagrams  should  make  this  process  clear. 
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4.  DEFINITION  OF  VPB  SPLINES.  VPB  splines  are  a sequence  of 
"hill"  functions,  where  each  hill  function  is  associated  with  a 
particular  knot  (specified  x value).  Each  of  these  hill  functions 
has  finite  support,  which  means  that  each  is  non  zero  only  over  a 
finite  interval.  The  objective  of  this  section  will  be  to  construct  a 
hill  function  corresponding  to  the  i th  knot.  This  objective  will  be 
accomplished  by  constructing  a minimal  set  of  y values  (corresponding 
to  knots  near  the  i th  knot),  which,  when  interpolated  by  a VP  spline 
interpolant,  will  yield  a twice  differentiable  hill  function  of  finite 
support. 

We  might  start  by  trying  to  use  four  knots  to  define  our  hill 
function,  but  the  following  drawing  will  indicate  why  four  knots  are 
not  sufficient. 


A VP  spline,  just  as  a cubic  spline,  cannot  have  more  than  one 
inflection  point  per  subinterval.  For  this  reason,  points  A and  B 
must  have  positive  curvature.  Since  the  curvature  must  be  negative 
somewhere  in  the  middle  interval,  there  would  have  to  be  two  inflection 
points  (C,D)  in  this  interval.  This  contradiction  shows  that  four  knots 
are  insufficient.  The  addition  of  only  one  more  knot  will  be  sufficient 
to  define  a hill  function,  because  no  subinterval  will  need  to  have 
more  than  one  inflection  point. 
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The  following  diagram  will  aid  in  visualizing  the  definition  of 
the  i th  VPB  spline  V.(x). 


Since  V^x)  consists  of  four  different  adjacent  functions,  j(x)  will 
refer  to  the  function  on  the  j th  subinterval  in  particular.  ’The  circl< 
in  the  drawing  indicate  the  unknown  quantities  to  be  constructed  and 
the  boxes  show  the  conditions  with  which  these  quantities  will  be 
determined.  The  i th  VPB  spline  may  therefore  be  defined  for 
3 < i £ N - 2 by: 

V1,1-2*X1-!*  ' 0 " yi-2 
’ ' ''I 

' 0 ' *1  + 2 

v'  . (x.  ) = 0 = y ' 

1,1-2'  1 -2  7 i -2 


vi,i+i^xi+2^  0 yi+2 


V , (x,  ) = 0 

1,1-2V  1-2' 


V1,1+1(X1+2)=0 


V1,1-2^X1-1^  Vi , 1 -i ^xi -1  ^ 

Vi'.i-^i)  = V,\i<*i> 

Vi,i^x1+i ^ ' Vi,i+i  ^xi+i  ^ 


It  is  understood  that  y and  y'  are  associated  with  V.j(x)  only;  the 
subscript  i has  been  dropped  from  these  quantities  for  convenience. 


The  first  five  of  these  conditions  are  trivially  satisfied;  the 
sixth  and  seventh  reduce  to: 


*1-1  = (*i-2/mi-2)yi-l 
yi+i  = -(*•  i+i /ni+i  )yi+i 

The  last  three  equations  may  therefore  be  written  as: 

Qnyn'_i  + Qi2*J  = 

Q21  y,  _!  + Q22yi  + Q2  3yi+i  = ^24 


Q y. 

32Ji 


+ Q y.  ~ Q34 

33^i+l 


where 


In 

= B. 

i-i 

■1 2 

= Ci-i 

1 

1 4 

= Ei  -1 1 

1 

21 

II 

+ 

l22 

= Bi 

*2-  - Wi  - Ei/li 

^32  “ ^i+1 

Q33  = ®i+i  + ^i+i^i+i/^i'^i+i )/ni+i 

Q34  = _Di+i/*’i 


and  the  A's,  B's,  C's  and  D's  are  defined  as  in  section  2. 

It  is  not  immediately  obvious  that  the  determinant  of  this  system 
cannot  be  zero.  An  outline  of  a proof  of  this  fact  will  therefore  be 
given  here.  First,  the  B's,  D's  and  E's  may  be  expressed  in  terms  of 
A's  and  C's.  The  Q's  of  the  determinant  will  therefore  be  of  the  form 

Q = aA  + bC 

where  a and  b are  not  both  zero.  It  is  easily  shown  that  if  either  a 

or  b is  not  zero,  then  it  must  be  positive,  provided  all  m's  and  n's 

are  >2.  We  have,  therefore,  that  all  Q's  of  the  determinant  are 

negative,  since  the  A's  and  C's  are  all  negative.  The  determinant 
of  the  system  is  given  by: 

A = Q11Q22Q33  _ Q11Q23Q32  _ Q12Q21Q33 


or 

a/ (Qi 1 Q22Q33 ) = 1 - (Q11Q23Q32  + Q12Q21Q33 )/(Qi 1 Q22Q33 ) 

In  order  to  show  that  A is  negative,  it  is  sufficient  to  show  that 

g = (Q11Q23Q32  + Qi  2^21  ^33 (Qn Q22Q33 ) < ^ 

In  the  process  of  computing  the  products  in  the  numerator  and  denominator 
of  q,  it  will  be  found  that  all  terms  of  the  numerator  will  be  of  the 
form  aACA  or  bCAC,  where  a and  b are  > 0.  The  denominator  will  also 
contain  an  AAA  term  and  a CCC  term.  It  is  important  to  note  that  all 
terms  of  both  the  numerator  and  the  denominator  are  negative.  Now,  for 
every  term  in  the  numerator,  there  is  a corresponding  t rm  in  the  denomi- 
nator whose  coefficient  is  strictly  greater  than  the  numerator  coefficient. 
This  fact  and  the  fact  that  all  coefficients  are  positive  implies  that 
this  quotient  is  always  less  than  one  and  therefore  that  the  determinant 
is  always  negative. 
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For  one  coefficient  pair,  we  have  the  following  example:  The 
coefficient  of  A^+JC^A^_  in  the  numerator  of  q is: 

(l+m1.ti+1/(ni+1t.))(n1_2-l-n._2/m1_2) 

The  coefficient  of  A^  + jCjA-j.}  in  the  denominator  of  q is: 

( n i - 1 + n i s.  i + j / ( n i + : s.  i ) ) ( m - 1 ) ( n _ 2 - 1 - n i _ 2 / m i _ 2 ) 

The  quotient  of  the  numerator  and  the  denominator  coefficients  reduces 
to: 

(ni+iti+mi*i+i )/ (Hi+j.*1  i (mi'l )(ni_l ) + n.(m.-l)ti+1) 

In  this  fraction,  the  left  and  right  terms  of  the  numerator  are  less 
than  the  left  and  right  terms  of  the  denominator,  respectively,  since 
m.,-  + n-j  - m-jn-j  < 0. 

The  first  and  second  VPB  splines  may  be  defined  through  the  follow 
ing  diagrams  and  conditions. 


Conditions  defining  Vj(x): 

Vi  .1  (xx ) = 1 = yl 
1 2 (x3 ) = 0 = y3 
Vi,2(x3)  = 0 = y3 
Vi',1  (X! ) = 0 

> 2 (X3  ) = 0 

Vl  ,1  (x2  ) = Vi  ,2(^2) 


The  conditions  defining  V2(x)  are: 

^2,i (xi ) = 0 = yj 


^2 , 2 (x2  ) = 1 = y2 

^2,3 (xt*)  = 0 = y4 
V2 » 3 ( x4 ) = 0 = y4 
^2,1  (X1  ) = 0 
V2,3(x4)  = 0 

V2,l  (x2  ) = V2 , 2 (x2  ) 


V2 , 2 (x3  ) = V2 , 3 (x3 ) 

The  last  and  second  last  VPB  splines  are  defined  in  virtually  the 
same  manner  as  Vi(x)  and  V2(x)  respectively. 

5.  VPB  APPROXIMATION.  We  may  approximate  univariate  data: 
(xk,yk)  1 < k < N by  a linear  combination  of  VPB  splines  defined  over 
a knot  sequence:  Uj  1 < j < n,  where  usually,  n « N.  The  approxi- 
mation is  thereforeJof  the  form: 

A ( x ) = l aiVi(x) 
j=l  J J 

The  usual  least  squares  normal  equations  are  therefore: 

T,  aj  j,  Wvj(xk> * J,  ykvf <*k>  0 1 1 1 ") 

The  matrix  of  this  system  has  a bandwidth  of  7 since: 

WW  ■ 0 

If  |1-j|  > 3 
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The  following  drawings  compare  the  behavior  of  VPB  and  cubic  B 
splines  In  fitting  some  noisy  data.  The  knots  used  in  each  comparison 
are  marked  on  the  x axis.  The  VPB  splines  obviously  give  a better  fit, 
but  it  should  be  noted  that  the  nonlinear  parameter  vectors  m and  n of 
the  VPB  splines  give  them  additional  degrees  of  freedom  with  which  to 
accomplish  this  better  fit.  One  could  therefore  say  that  these  compari- 
sons are  unfair,  if  taken  at  face  value.  It  is  also  true,  however,  that 
m and  n are  not  established  by  the  least  squares  process,  but  rather  by 
local  considerations  only.  It  is  therefore  difficult,  if  not  impossible, 
to  compare  these  two  methods  mathematically.  For  purposes  of  practice, 
however,  it  seems  that  the  most  difficult  part  of  the  spline  approxima- 
tion problem  is  in  the  knot  selection,  and  VPB  splines  seem  to  do  quite 
well  with  relatively  few  suboptimal  knots. 
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IPIIST  FUNCTION  AND  CUBIC  B SPLINE  RPPROXIMANT 


NOISY  FUNCTION  AND  VPB  SPLINE  APPROXIMflNT 


VPB  VS.  CUBIC  B RMS  ERROR 
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BEST  L2  APPROXIMATION  FROM  NONLINEAR  SPLINE 

MANIFOLDS  I - UNICITY  RESULTS* 
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ABSTRACT.  Two  unicit^  results  concerning  best  L_  approximation  from 
(nonlinear)  manifolds  of  C piecewise  polynomials  with  variable  break 
points  are  given.  An  algorithm  for  obtaining  the  best  approximants  is 
also  included. 


1.  INTRODUCTION . We  will  denote  a knot  sequence  on  [0,1]  by 
(tQ*  • ' ’ ’ tN+l)  ’ 0 = *=0  < C1  < •••  < tN+l  = 1'  Let  p2k  be  the  C 


1.  Let  P be  the  class 


of  all  polynomials  with  order  2k  (or  degree  <_  2k-l)  and  set  S = {s  eC[0,1]: 

N 2k  " 

there  is  a £ so  that  s t P on  (t.,t  ) for  i = 0,...,N).  Clearly, 

2k  ill 

S^  is  a nonlinear  manifold  and  could  be  represented  by  linear  combinations 

of  B-splines  with  each  knot  having  multiplicity  2k-l. 

2k 

Although  is  not  closed  in  L^ 1 0 , 1 ] one  can  show  using  the  argu- 

ments of  [1,4]  that  if  f is  continuous  then  f has  a best  L0 [ 0 , 1 ] approximant 
2k 

from  . Thus,  in  this  article  we  concentrate  only  on  the  questions  of 

uniqueness  and  eventual  uniqueness.  In  particular,  in  section  2 we  will 

sketch  the  proofs  of  the  following  theorems. 

"’k  f2kl  f?kl 

Theorem  1.  Let  f £ C“  [0,1]  with  fv  > 0 mi  [0,1].  Suppose  that  fv 

is  logarithmic  concave  on  (0,1).  Then  for  every  positive  integer  N,  f 

2k 

has  a unique  best  L_ [0,1]  approximant  from  S . 


The  logarithmic  concavity  condition  is  not  necessary  when  we  have 
"many"  knots. 

XhgQXgJD  2.  Let  f £ c^k+3)[o,i]  wjth  > q cm  [0,1].  Then  there 

exists  an  N such  that  for  each  N > N , f has  a unique  best  L_[0,1] 

0 2k  0 ^ 

approximant  from  S“  . 


It  should  be  emphasized  that  N^  depends  on  f.  This  is  the  idea  of 
eventual  uniqueness.  We  will  discuss  in  section  3 a method  of  computing 
the  best  T^tO.l]  .approximants  and  in  section  4 some  numerical  examples 


will  be  presented. 


2.  SKETCH  OF  PROOFS.  In  this  section  we  will  sketch  the  proofs  of 
Theorems  1 and  2.  As  will  be  evident  the  main  tool  is  topological  degree 
theory. 

Let  C RN  be  the  open  simplex  = (t  ,...,t  ):  0 < t < ...  < t 

^ 1 fJ  1 N 

< 1}.  The  t will  denote  the  interior  knot  sequence  of  the  various  splines. 


.2k 


We  are  interested  in  studying  the  uniqueness  of  an  s e S so  that 

1 i N 

f 2 '>L  1 

inf  { |g-f  j : g c s;.  } « |s-f  |"  . 

J0 

2k 

Such  an  s will  be  called  a best  L [0,1]  approximant  to  f from  S 

2 N 

The  following  two  lemmas  will  be  useful  in  formulating  the  proofs 
of  Theorems  1 and  2. 

(2k}  (2kl  ?k 

Lemma  1.  Let  f c o'1  '[0,1],  fv^K;  > 0,  and  s e Sf  be  a best  L„[0,1] 

2k  N N ^ 

approximant  to  f from  S^  . Let  s have  interior  knots  at  { t^  }i=) . Then 

the  restriction  of  s to  any  of  the  intervals  (t.,t.  ),  i=0,...,N,  is  a 

1 2k 

best  L^  approximation  to  f on  the  Interval  ( t . , t ) from  P . That  is 


'i+1 


[ (f-s)  (t)  ] (tj  )dt  = 0,  j=0, 1 2k-l;  i=0 N. 


This  lemma  is  easily  established  by  differentiating  the  error  with 
respect  to  the  knots.  A minor  but  nagging  point  in  the  proof  of  this 

lemma  requires  one  to  show  that  if  s is  a best  L [0,1]  approximant  to  f 

2k  (2k-l)  (2k-l)  ^ 

from  then  s (t^+)^  s (t^-).  This  result  can  be  obtained  via 

the  zero  counting  techniques  found  in  [5]. 

The  second  lemma  tells  us  that  all  the  knots  of  the  best  approximation 

are  active. 

2k 

Lemma  2.  Let  f be  a continuous  function  on  [0,1]  with  f t SM  and  let 

2k  *'  2 k 

s t S.,'  be  a best  L„[0,1]  approximation  to  f from  S._  . Then  s t/  S.  .. 

N 2 N N-l 

Using  Lemmas  1 and  2 we  can  formulate  a necessary  condition  for  the 

N 2k 

knots,  t , to  be  the  knots  of  a best  L.[0,1]  approximant  to  f from  S%, 

~ (2k)  2 N 

provided  f > 0.  In  particular,  let  i . be  the  best  L [0,1]  approximant 

2k  ^ 2 

to  f on  the  interval  (t^,t^+^)  from  P , i=0,...,N.  Then  a necessary 
condition  for  the:  t^’s  to  be  the  knots  of  a best  L^ [0,1]  approximant  to  f 
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Is  that  X. ^ C 1 1+1 ) " ^i+l^i+l^  for  That  is,  we  would  like  to 

show  that  the  map 


F : EN  RN 


defined  by 


^ ’ ' ' ' ~ ^N-l^N^ 


has  precisely  one  solution. 

Letting  At^  * t^+^  - we  see  that 


N 7Y 

F^O  --(At1)ZK/(2k-l) 


! [‘  t2k-‘ 

Jo 

■i: 


(l-T)2kf(2k)(TAt.+ti) 


+ (Ati  :i  )2k/(2k-l) ! fn  T2k(l-T)2k“1f(2k)(TAt11+t1_1)dT. 

(2k) 

Using  the  arguments  in  [1]  one  can  now  see  that  if  f >0  and 
log  f^k)  concave  then  the  degree  of  F with  respect  to  £ and  the  point 
(0,...,0)  is  1 and  that  this  counts  the  number  of  solutions.  Similarly, 
Theorem  2 is  proved  by  showing  that  the  degree  is  1 and  that  eventual ly 
(i.e.  for  large  N)  this  is  the  number  of  solutions.  Details  and  related 
results  can  be  found  in  [3]. 

N N 

3.  NUMERICAL  ALGORITHM-  Let  t c £ and  k be  a positive  even  integer. 


For  i=l N,  set 

F 


i F1(tH,f)  - (At1_1)k/(k-l)! 


- (Atl)K/(k-l)! 


Tk(l-  i)k_1f (k)  (xAti_1  + ti_1)dT 


Tk_1(l-T)kf(k)(TAtl  + t1)di 


where  A t » £j+l  - C j ‘ The  Jacobian  matrix  [SF^/Stj]  of  (F^,...,FN> 

is  tridiagonal  and  its  nonzero  entries  can  be  written  as  follows: 

rl 


3F./3t.  . 

= -k(At 

i i-1 

i-1 

3F  /3t  = 

“v/'1 

♦ wt/-1 

k-1 , 


:k  1 (l-i  )kf  ^ (xAti_1  + t^Jdx , 2<i£N 


xk(l-T)k_2(kx-l)f(k)(xAt1_1  + t1_1)dx 


xk-2(l-T)k(k(l-x)-l)f(k)(xAt.+  t )dt. 


1 < i < N, 
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I k 


1 


9Fi/8ti+l 


-k(Ati)k_1/(k-l)! 


Tk(l-T)k  (k)(TAt1  + ti)dT, 


Once  k,  jtN  and  are  known,  the  numerical  iterations  by  Newton's  method 

can  be  easily  calculated  to  get  a solution  of  F = 0,  i*l,...,N.  At  each 

iteration,  we  use  a Gaussian  quadrature  formula  to  approximate  the 

integrals  which  appear  in  the  Jacobian  matrix  [ 3F  / 9t . ] and  in  (F  F ). 

i j 1 ’ N 

In  the  following  we  will  describe  a numerical  algorithm  to  locate  a 
zero  of  (Fx,. . . ,FN> . 

N 

Step  1.  Initialize  t^  . 

N 

Step  2.  Check  ^ to  be  sure  it  is  in  sequence. 

Step  3.  Check  the  number  of  iterations  and  the  size  of  relative  error 
between  two  consecutive  iterations. 

Step  4.  Build  (F^,...,F^)  and  the  tridiagonals  of  [SF^/St.]  via 
Gaussian  quadrature  formula. 

Step  5.  Solve  Ax  = b where  [a  ] = [3F./3t.]  and  b.  = F.. 

iJ  l J ir 

Step  6.  Set  t^  = t^  - x^,  where  x^  is  obtained  from  Step  5.  Then  go 
to  Step  2. 

Remarks.  For  lack  of  information  in  Step  1,  one  usually  starts  with 
N 

equally  spaced  _t  . In  order  to  compare  with  Burchard  and  Hale's  asymptotic 
constant  [2],  as  will  be  done  in  the  next  section,  we  use  his  balanced 
mesh  as  initial  knots.  Step  2 seems  redundant.  However,  it  serves  as  a 
safeguard  against  any  undesired  mesh. 


4.  NUMERICAL  EXAMPLES.  Since  the  approximation  is  taken  in  the  L -norm, 
(k) 

, k=2,  one  can  easily  calculate  Burchard  and  Hale's  asymptotic 


for  given  f 
constant 


B2,k(f) 


C2,kHf"Ho! 


Ilf"  I 

12  /T 


If"  |°)1/o. 


A similar  numerical 


where  a » (k+2  *)  * = 2/5  and  ||f"|^  = ( 

scheme  as  above  will  work  for  k=4,6,... 

In  both  examples  below,  we  are  going  to  consider  only  splines  of  order 

2 2 

2,  that  is,  from  S^.  Let  f e C [0,1].  For  N=l,...,47,  we  try  to  calculate 
(N+l ) 2 inf  Ilf-sIL 


scS 


N 
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and  note  that  it  "converges"  to  B„  _ ( f ) as  N increases. 

N .N 

A mesh  t^  c 1 is  called  balanced  with  respect  to  f if 


-1  -1  N 

where  o = (k+2  ) [2],  We  take  such  t as  our  intial  guess  for  the  above 

algorithm.  Let  T = max  (t  -t.  ,)  and  let  x.  be  the  increment  in  t by 
l<i<N+l  1 1-1  1 1 

Newton  s method.  Furtner  iterations  need  not  be  carried  out  if 

max  |x  | < 10  10t  . 

1!i5N  1 

Actually,  it  takes  less  than  6 iterations  in  all  cases  to  satisfy  the 
above  terminating  criterion. 

2 

At  the  end  we  will  compare  (N+l)*-  ||f-s||  „ among  three  different  piece- 

M ^ o 

wise  polynomials  S.,  namely,  (i)  S = S . (t  ) e S*  such  that  ||  f — S )L  < ||f-g||  - 
2 J ” jj  1 ^ 4 

for  all  g C S , (ii)  S = S„(t_  ) c such  that  is  balanced,  and 
2 ^ L ** 

(iii)  S =S.,(t  ) t S“  such  that  t^  is  balanced  and  S_  is  allowed  to  be 

3 3 — N — 3 

discontinuous  at  each  of  the  knots.  In  each  case  best  linear  approximants  are  computed. 
We  consider  the  following  two  examples: 

(a)  f(x)  = x^/6  and  f"(x)  = x on  [0,1); 

(b)  f (x)  =*  -4x^2  and  f"(x)  = x on  (0,1]. 

Note  that  in  example  (a)  |jf-s||  ^ and  the  integrals  which  form  the 

entries  of  (F^,...,F^)  and  [9F^/3t.]  can  be  accurately  computed  by  using 
the  Gaussian  quadrature  formula  with  4 Gaussian  points.  The  reason  is 
simply  that  the  integrand  in  each  integral  is  a polynomial  of  degree  < 6. 

However,  in  example  (b)  f"  is  not  only  not  a polynomial  but  also  unbounded 
on  (0,1).  Fortunately,  the  integral  involved  with  f"  over  (t^.t^)  is 
under  control,  because  f"  has  to  be  multiplied  by  a positive  weight  function 
with  a double  zero  at  t^.  To  gain  more  accurate  approximation  of  each 
Integral  in  example  (b),  we  subdivide  the  interval  in  question  into  four 
6ubintervals  and  then  apply  the  Gaussian  quadrature  formula  on  each 
subinterval . 

Before  tabulating  our  results,  we  caluculate  Rurchard  and  Hale's 
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asymptotic  constant  for  each  example  as  follows: 

[B  (x3/6)  J 3/2  = (5/ 7)  5/2  i .016069918, 

12/3 

[B  (-Ax1/2)]j/2=  — — (5/2) 5/2  = .168284782. 

’ 12/5" 

Table  1 below  corresponds  to  example  (a)  and  Table  2 is  for  example  (b). 

2 

In  both  tables,  and  represent  three  constants  (N+l)  ||f-sj| 

among  three  different  piecewise  polynomials  mentioned  above. 


Table  1 


N+l 

fl 

(L2 

s 

2 

.01746675 

.01765997 

.01759154 

4 

.01676272 

.01689968 

.01686163 

6 

.01653038 

.01661081 

.01660505 

8 

.01641471 

.01649137 

.01647049 

10 

.01614548 

.01641029 

.01639462 

12 

.01629940 

.01635437 

.01634127 

14 

.01626652 

.01631423 

.01630298 

16 

.01624188 

.01628402 

.01627416 

18 

.01622273 

.01626046 

.01625168 

20 

.01620742 

.01624157 

.01623366 

22 

.01619490 

.01622609 

.01621889 

24 

.01618446 

.01621317 

.01620656 

26 

.01617564 

.0 1620222 

.01619612 

28 

.01616807 

.01619283 

.01618716 

30 

.01616152 

.01618469 

.01617939 

32 

.01615579 

.01617755 

.01617259 

34 

.01615073 

.01617125 

.01616658 

36 

.01614624 

.01616565 

.01616123 

38 

.01614222 

.01616063 

.01615645 

40 

.01611860 

.01615612 

.01615214 

42 

.01611512 

.01615203 

.01614824 

44 

.01611235 

.01614831 

.01614470 

46 

.01612963 

.01614492 

.01614146 

48 

. 01 6 L27  14 

.01614180 

.01613849 
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Table  2 


N+l 

fl 

C_2 

S 

2 

.24631921 

.28220634 

.27461178 

4 

.29629390 

.32423129 

.32094308 

6 

.31722593 

.33869744 

.33660287 

8 

.32872871 

.34601495 

.34447856 

10 

.33600192 

.35043162 

.34921859 

12 

.34101598 

.35338673 

.35238464 

14 

.34468195 

.35550266 

.35464902 

16 

.34747905 

.35709237 

.35634888 

18 

.34968345 

.35833044 

.35767192 

20 

.35146549 

.35932191 

.35873094 

22 

.35293594 

.36013378 

.35959779 

24 

.35416996 

.36081080 

.36032042 

26 

.35522031 

.36138398 

.36093207 

28 

.35612517 

.36187551 

.36145647 

30 

.35691280 

.36230168 

.36191105 

32 

.35760461 

.36267471 

.36230889 

34 

.35821707 

.36300396 

.36265997 

36 

.35876310 

.36329670 

.36297210 

38 

.35925295 

.36355870 

.36325141 

40 

.35969486 

.36379454 

.36350281 

42 

.36009554 

.36400797 

.36373030 

44 

.36046051 

.36420203 

.36393712 

46 

.36079434 

.36437924 

.36412598 

48 

.36110085 

.36454171 

.36429911 
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BEST  L2  APPROXIMATION  FROM  NONLINEAR 
SPLINE  MANIFOLDS  11  - APPLICATION  TO 
OPTIMAL  QUADRATURE  FORMULA* 

Charles  K.  Chui,  Philip  W.  Smith  and  Joseph  D.  Ward 
Department  of  Mathematics 
Texas  A&M  University 
College  Station,  Texas  77843 


ABSTRACT.  It  is  shown  that  for  certain  positive  weight  functions  and 
certain  boundary  conditions  the  corresponding  optimal  quadrature  formula 
is  unique. 

1.  INTRODUCTION.  In  this  paper  we  use  the  notation  as  well  as  the 

as  the  results  in  [1]  to  study  the  unicity  and  existence  of  certain  optimal 
quadrature  formulae.  The  results  and  proofs  will  be  presented  in  the  next 
section. 

2.  OPTIMAL  QUADRATURE  FORMULAE.  Let  s = s(£N)  c S2  w C C2[0,1],  and 

2 N 

e - w - s.  IffcC  [0,1]  then  we  can  define  the  linear  functional  R by 


R(f)  = 


f"[w-s] 


f "e . 


If  we  integrate  this  formula  by  parts  twice,  we  obtain 

i i N 

(2.1)  R(f)  = lf'e]J  - [e’f]J  - J A^C^)  + 


where  A, 


[s'] 


V 

t . - 
l 


i=l 


That  is, 


R(f ) = 


fw"  - l l C . f(J)(i)  - J A f(t  ), 
i=0  j=0  J i=l 


fw". 


and  hence  we  think  of  R as  an  error  functional  for  a quadrature  formula 
with  weight  w".  The  interested  reader  should  see  [4]  and  the  references 
therein.  This  quadrature  formula  is  clearly  exact  for  linear  polynomials 
so  that  e could  be  thought  of  as  the  Peano  kernel  for  the  quadrature  error. 

Schoenberg  [4]  has  coined  the  phrase  optimal  quadrature  formula  in 


*This  research  was  supported  by  the  U.  S.  Army  Research  Office  under  Grant 
No.  DAHC  04-75-G-0186. 
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reference  to  that  formula  which  has  minimal  L ^ Peano  kernel  with  a fixed 
number  of  knots  and,  perhaps,  some  boundary  conditions  built  in.  For 
instance,  one  might  require  e(0)  = e(l)  = 0 so  that  f'  does  not  appear 
in  the  formula  or  one  might  require  e(0)  = e(l)  = e'(0)  = e'(l)  yielding 
what  Schoenberg  calls  an  "open"  formula. 

2 

Notice  that  minimizing  the  L norm  of  e over  all  possible  s l S 

2 N 

corresponds  to  minimizing 

f1  2 

(2.2)  sup{ | R(f) | : (f")  < 1}, 

J0 

and  hence  could  be  thought  of  as  choosing  the  best  points  to  evaluate 
f in  order  to  approximate  fw".  If  w"  is  positive  this,  of  course,  means 
that  w is  convex  which  allows  us  to  use  the  results  of  [2]  or  [3).  We  will 
sketch  the  proof  of  the  following  theorem 

Theorem  1.  Let  w"  > 0 and  log  w"  be  concave  on  [0,1].  Then  for  any  of  the 
boundary  conditions 

i)  e(0)  and  e(l)  unrestricted, 

ii)  e(0)  = e(l)  = 0,  or 

iii)  e (0)  = e(l)  = e ' (0)  = e'(l)  = 0, 

there  is  a unique  optimal  quadrature  formula  of  the  type  (2.1)  minimizing 

(2.2) . 

2 

Since  minimizing  (2.2)  is  equivalent  to  approximating  w from  S"  the 
results  of  [1]  or  [2]  immediately  imply  that  there  is  a unique  best 

O 

approximant  to  w from  and  hence  a unique  optimal  quadrature  formula. 

Part  (i)  follows  directly  from  the  results  of  [1].  In  ii)  a few  results 

must  be  established  since  the  arguments  in  [1]  don't  carry  over  directly. 

—2  2 

Hereafter  Sv,  will  denote  [s  c S • s(0)  = s(l)  = 0}.  The  conditions  that 
N N _2 

w' (0)  and  w'(l)  are  finite  ensure  that  w has  a best  approximant  from  S>t  , 

N 

if  we  assume  v(0)  = w(l)  = 0.  Then  w(t)  < 0 on  [0,1],  and  it  is  easily 

seen  that  the  first  and  last  linear  segment  of  any  best  approximant  must  lie 

in  the  triangle  formed  by  the  x-axis  and  the  lines  tangent  to  w at  t=0 

and  t=l.  The  fact  that  the  knots  are  simple  follows  from  an  agrument  in  [3]. 

— ° 

We  next  show  that  if  s*  is  a best  approximant  to  w from  S^,  with  knot 
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sequence  0 


t < t < 
0 1 


< t. 


t,_ ,,  = 1,  then  the  error  e 
N+l 


orthogonal  to  linear  functions  between  t.  and  t,,. 

l i+1 


w - s*  is 
for  1 < i < N-l  and 


on  the  intervals  [ 0 , t ^ ] and  [t^,l],  w-s*  is  orthogonal  to  the  linear 

functions  which  vanish  at  t = 0 and  t = 1 respectively.  We  establish  our 

-2 

claims  by  first  noting  that  the  dimension  of  is  equal  to  N and  that  every 

—2  ‘ 

spline  In  Skl  is  of  the  form 
N 

s(x)  = Ai(x_ti)+ ' A1(l-t1)x. 


Define  G(A^, 


2 —2 
[w-s]  dx  where  s is  any  spline  in  S... 
0 N 

Suppose  s*  is  the  best  approximant  to  w having  the  form 


s*(x)  = Ei=i  At(x-ti)+  - vl-i  \(i"tt)x- 

By  the  usual  variational  arguments  one  now  may  easily  establish  the 
orthogonality  claims.  Finally  it  may  be  shown  that  Lemma  3 of  [1]  goes 
through  under  our  assumptions  on  w.  Thus  there  is  a unique  optimal 
quadrature  formula  of  the  type  (2.1)  with  the  additional  conditions 
e(0)  = e (1)  = 0. 

Part  (iii)  may  now  be  derived  using  slight  modifications  of  the  arguments 
employed  for  (ii) . 

We  wish  to  turn  our  attention  to  eventually  unique  optimal  quadrature 

formulae.  By  this  we  mean  that  for  fixed  w and  some  integer  N , the  best 

* 2 ' 
approximants  s^  c Sj,  to  w are  unique  for  N > N^.  We  now  may  state 

Theorem  2.  Let  w"  > 0 on  [0,1].  Then  for  any  of  the  boundary  conditions 
(i)  e(0)  and  e(l)  unrestricted. 

(ii)  e(0)  = e(l)  = 0,  £r 
(iii)  e (0)  = e (1)  = e ' ( 0 ) = e'(l)  = 0, 
there  are  eventually  unique  optimal  quadrature  formula  of  the  iyysL  (2.1) 
minimizing  (2.2). 

We  sketch  a proof  of  this  theorem.  As  in  Theorem  1,  (i)  follows  from  the 
results  of  [1]  and  (iii)  Will  fo.’  low  by  arguments  similar  to  those  used  in 
establishing  (ii).  The  key  to  this  is  in  snowing  that  Proposition  1 of  [1] 
is  applicable  to  our  situation. 
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r.rqpositlon  1.  Let  A — (a^)  be  a tridiagonal  N x N real  matrix  with 
positive  diagonal  entries.  Then  if 

(2.2) 


2 2 

a a a a.  , (1  +■  n /4N  )/4 

n,n-l  n-l,n  — n,n  n-i,n-l  '' 


to r n = 2,...,N  _it  Jfollows  that  det  A > 0. 

Hereafter  we  use  the  notation  of  section  3 in  (l].  Our  aim  is  to  show 

N m 

that  the  determinant  of  ))  is  positive  where  J(F(t  ))  refers  to  the 

M 

Jacobian  matrix  of  F At  ).  The  only  thing  to  be  checked  is  that  equation 
(2.2)  is  valid  in  the  case  n = 2 or  n *=  N for  it  is  only  here  that  our 
equations  differ  from  those  in  ( 1 ] . We  restrict  our  attention  to  n = 2. 
For  sufficiently  smooth  w,  a Taylor  expansion  shows  that  w"(At  ) = K + ^(At  ) 

where  At  goes  to  zero  as  the  number  of  knots  gets  large.  Thus,  for  the 

a 

eventual  uniqueness  problem  we  may  assume  w = x . Hence, 

fl 2 


(2.3) 

(2.4) 


- >2  f1  od,  - <a,  >2  r iitoi  .2dT 


qu  ) - (At0 


F2(t")  = (At,)' 


f 1 


12 


T(l-T) 


2 

- *2dr  - (At0) ’ 


6 

T(l-T) 

6 


- -2di 


Easy  computations  lead  to 


3F, 

3t. 


-4  At, 


1 s2 

T(l-T) 


^ = 4(Atl)  j"  dT  + 4(At2)  dT  f 


JF 

5T  = -4(Ati} 


1 2 
KH)  , T 
— r — dT  . 
o 6 


i!i . «(At0)  f sijp.  d,  <■  UAL ) [‘  ift-j r 


dT 


1 


Now 


f1  (1-D 2 

d t = 1/72 

Jo  6 
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and 


t(1-t  )dT  = 1/48 
12 


so  that 


a21  = “ Ati/18  » 

a12  = ’ 
At  + At2 

""  18  : 


a22 


11 


and  thus  a^  = 2At^/18,  a^  _> 


11/5  At 


18 


At()  Atl 

12  18  ’ 

we  get  At2  = At^ 

Ati 

and  so  a2lai2^a 

11  22 

= *'3/2  AtQ 
< 1/22/5  < 1/4 


so  that  the  proposition  applies  and  the  determinant  is  positive.  The  rest 
of  the  proof  in  [1]  applies  directly  and  thus  our  assertion  is  proved. 
Finally  we  consider  the  case  where 

= {splines  of  order  2k  with  N interior  kno^s 
each  with  multiplicity  2k-l}. 

Here,  our  remainder  formula  has  the  form 

,,  ,,  f1  , (2k)  _ f1  , (2k)  . . 

(2.5)  R(f)  = | f e = f (w-s | 


'0 


2k  2k 

where  here  our  weight  is  assumed  to  be  in  C [0,1]  and  s r S„  . 

N 

integrating  by  parts  2k  times,  one  obtains 

N 2 k- 2 

(2.6)  B (f)  + ...  + B (f)  + l l (-1) 

i=l  j=0 


j+1 


V°>(ti)  + 


Upon 
1 


2k,  , 
w f dt 


0 


where  B (f)  involves  boundary  values  of  the  £th  derivative  of  f and 
t - 


(j) 


ci+ 


For  these  type  quadrature  formulae  and  under  assumptions 


A = s 

ij 

analogous  to  those  of  Theorems  1 and  2 of  this  section  we  again  obtain 
unique  and  eventually  unique  optimal  quadrature  formulae.  That  is 


Theorem  3.  I.et  w 


(2k) 


> 0 and  logw^^  concave  on  [0,1].  Then  there  is  a 
unique  optimal  quadrature  formula  of  type  (2.6)  minimizing  sup{|R(f)|  : 

[f(*-k)]2  < l).  If  no  assumption  is  made  on  log  w^^but  w £ C^^+^[0,1] 
0 — 
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(2.6). 
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NL9  - AN  ADAPTIVE  ROUTINE  FOR  NUMERICAL  QUADRATURE 

Arthur  Hausner 
Harry  Diamond  Laboratories 
Management  Information  Systems 
Adelphi,  Maryland  20783 

ABSTRACT.  NL9  (Near  Lobatto,  9-point)  is  an  adaptive  routine  that 
provides  an  estimate  of  the  definite  integral 


y = jj  f(x)dx 

to  within  an  absolute  or  relative  error  supplied  by  the  user.  The  basic 
scheme  compares  successive  estimates  of  subintegrals  after  applying  1-, 

3-,  and  5-point  near- Lobatto  quadrature  formulas.  These  formulas  differ 
from  true  Lobatto  formulas  in  that  the  transformed  near  end  points  of 
+ 0.9999  are  used  instead  of  +_  1 for  function  evaluations,  in  order  to 
accomodate  integrands  with  end-point  singularities.  If  the  estimate 
proves  unsatisfactory,  an  optimum  9-point  Gaussian  formula  is  applied, 
using  the  previous  5 points.  If  the  integral  estimate  is  still  unsatis- 
factory, interval  bisection  and  queue-stacking  is  implemented,  with  the 
basic  scheme  applied  to  the  new  subintervals.  After  all  subintervals  in 
any  one  cycle  are  processed,  an  Aitken  62  transformation  attempts  to  ac- 
celerate the  sequence  of  estimates  obtained  by  successive  cycles.  This 
procedure  is  very  effective  for  increasing  efficiency  in  problems  with 
end-point  singularities,  without  affecting  reliability.  The  queue  is 
currently  set  for  16  subintervals;  when  the  queue  is  filled,  NL9  switches 
to  a 96-point  Gauss-Legendre  quadrature  formula  with  queue-stacking  and 
cycle  acceleration.  This  was  found  to  increase  efficiency  for  highly 
oscillatory  integrands  without  affecting  reliability. 

NL9  has  been  tested  with  a wide  variety  of  integrands  and  error 
specifications,  and  has  been  found  to  be  an  efficient,  reliable  routine, 
suitable  for  general-purpose  applications  by  unsophisticated  users.  Un- 
like its  predecessor,  FOGIE , NL9  handles  discontinuous  integrands  with 
good  reliability  and  efficiency.  In  addition,  the  efficiency  of  NL9  has 
been  improved  overall  (the  minimum  number  of  function  evaluations  is  3) 
at  only  a slight  expense  in  reliability. 

1.  INTRODUCTION.  NL9  (Near-Lobatto,  9-point)  is  an  adaptive  routine 
written  in  FORTRAN  that  provides  an  estimate  of  the  definite  integral 


to  within  an  absolute  or  relative  error  supplied  by  the  user. 
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It  was  written  to  overcome  two  main  deficiencies  of  its  predrcessor, 

FOGIE  [ 1 J : 

a.  Unreliability  for  discontinuous  integrands. 

b.  Inefficiency  for  low  accuracies  because  the  minimum  number 
of  function  evaluations  was  24. 

Of  course,  the  large  minimum  number  of  function  evaluations  resulted 
in  FOGIE  being  an  extremely  reliable  code,  except  for  discontinuous  in- 
tegrands. For  use  as  a general-purpose  quadrature  routine  for  unsophisti- 
cated users,  however,  it  is  desirable  to  have  a routine  that  will  return 
correct  estimates  with  high  probability  for  all  types  of  integrands,  and 
do  it  in  an  efficient  way,  i.e.,  with  a comparatively  small  number  of 
function  evaluations.  The  main  reason  for  the  unreliability  of  FOGIE 
for  discontinuous  integrands  was  that  no  function  evaluations  were  made 
at  the  end  points  of  any  subinterval.  To  overcome  this  for  a FOGIE- like 
routine,  Lobatto  formulas  should  be  used,  whereby  function  evaluations 
are  required  at  + 1 of  the  transformed  interval  (the  end  points) . However, 
if  this  were  done,  some  integrands  with  end-point  singularities,  which 
blow  up  at  the  end  points,  would  have  to  be  specially  treated.  It  was 
therefore  decided  at  the  outset  that  near-Lobatto  formulas  would  be  used, 
requiring  function  evaluations  at  +0.9999  of  the  transformed  interval. 

The  loss  in  reliability  due  to  this  change  is  negligible  [1], 

To  overcome  the  second  disadvantage  of  FOGIE,  a different  scheme  was 
implemented  for  removing  a subinterval  from  the  queue  of  subintervals  to 
be  processed.  As  will  be  seen,  this  resulted  in  the  possibility  of  re- 
moving a subinterval  from  the  queue  after  only  three  function  evaluations. 
Since  the  initial  interval  is  the  first  subinterval , the  minimum  number 
of  function  evaluations  was  reduced  to  three.  This  reduction  does  decrease 
reliability,  but  not  to  the  extent  that  one  may  suppose.  The  major  dif- 
ficulty is  in  bypassing  a narrow  peak  that  contributes  to  the  integral. 

This  paper  discusses  various  features  of  NL9  and  the  tests  that  were 
made,  leading  to  the  conclusion  that  it  is  a very  reliable  and  highly 
efficient  routine  for  numerical  quadrature- -particularly  suited  as  a 
general-purpose  routine  for  unspohisticated  users. 

2.  THE  BASIC  SCHEME.  The  basic  scheme  of  NL9  can  be  convenientl) 
broken  into  four  sections  best  understood  by  first  describing  the  calling 
sequence  of  the  routine  (REAL*4  is  single  precision  and  REAL*8  is  double 
precision ) : 

CALL  N1.9  (FCT,  XL,  XU,  ACC,  INTVL,  LIMIT,  Y,  ERROR,  NFUN.IER) 
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r "r " 


FCT 

XL 

XU 

ACC 


INTVL 


LIMIT 

Y 

ERROR 


NFUN 

IER 


the  dummy  name  of  the  subroutine  to  compute  f(x). 
the  lower  limit  of  integration  (real*8)  . 
the  upper  limit  of  integration  (real*8). 
the  accuracy  desired  (real*4) . 

If  ACC>0,  ACC  is  used  for  the  absolute  error. 

If  ACC<0,  -ACC  is  used  for  the  relative  error, 
the  number  of  equal  intervals  into- which  (XL, XU) 
is  initially  divided.  The  integral  of  each  interval 
is  found  separately  and  the  results  are  summed. 

If  INTVL<0,  -INTVL  is  the  subdivision  number  and  a 96-point 
Gauss-Legendre  formula  is  used  as  the  method  of  integral 
estimation. 

the  maximum  number  of  function  evaluations  allowed, 
the  returned  integral  estimate  (real*8) . 
if  positive,  the  returned  magnitude  of 
the  estimate  of  error  (real*4).  If  negative, 
the  magnitude  of  error  estimate  and  the  value 
by  acceleration. 

the  number  of  function  evaluations, 
the  output  error  code. 

=1,  the  answer  is  thought  to  be  within  the  accuracy  specified, 
the  error  is  thought  to  exceed  that  which  was  specified 
by  ACC. 

the  run  was  aborted  because  of  exceeding  LIMIT, 
the  run  was  aborted  because  the  queue  was  filled  while 
using  the  96-point  Gaussian  formulas. 


-ERROR  is 
Y was  obtained 


= 2 


2.1  The  Initial  Phase.  We  assume  INTVL=1,  so  that  near-Lobatto  formulas 


will  be  applied  to  the  entire  interval  in  the  initial  phase.  The  interval  is 
transformed  to  (-1,+1)  for  simplicity  in  applying  the  formulas.  Estimates 
are  then  obtained  generally  by  using 


For  n=3,  we  use 


0, 


vi  l = 1.3331999799973329995 


+ 0.9999,  w2  3 


0.33340001000133350002. 


The  estimate  1^  is  exact  for  polynominals  to  degree  3.  Ij  is  compared 
to  1 3 (see  section  3)  and  the  process  continues  if  the  criteria  for 
stopping  are  not  satisfied. 


I 3 is  next  computed  from 


1 

= 0 , 

W1 

2,3 

= +0.9999, 

w2 , 3 

4,5 

= +0.65458817259184819938, 

w4 , 5 

0.71103996442309754558 

0.10009004501475245572 

0.54438997277369877148 


which  computes  integrals  of  polynominals  to  degree  7 exactly.  1^  is  compared 
to  I5  (see  section  3)  and  the  process  continues  if  the  criteria  for  stop- 
ping are  not  satisified. 


19  is  then  computed  with 


(6) 


1 

= o , 

W1 

= 0.34572776695800701539 

2,3 

= +0.9999  , 

W2 , 3 

= 0.030740824922769000059 

4,5 

= +0.65458817259184819938  , 

W4,5 

= 0.28395029242094264548 

6,7 

= +0.89031636893046580368  , 

"6,7 

= 0.17924470129867993407 

8,9 

= +0.34094819564600667896  , 

W8, 9 

= 0.33420029787860491270 

which  computes_ exact  estimates  of  integrals  of  polynominals  to  degree  13. 
Ij  and  Ig  are  composed  (see  section  3)  and  the  process  continues  if  the 

criteria  for  stopping  are  still  unsatisfied. 


In  the  above,  only  the  3-,  and  5-point  estimates  are  near  Lobatto. 

The  9-point  estimate  uses  the  previous  five  function  evaluations. 

Throughout  the  initial  phase,  a check  is  made  for  f (x^)=f(x^+j)  for 
1 < i < 4.  If  all  checks  are  true  to  11  digits  or  more,  the  function  f(x) 
is  flagged  symmetric  (even),  and  it  is  therfore  assummed  that  f(x)  = f(-x), 
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x being  the  coordinate  in  the  transformed  interval.  This  results  in  a 
savings  of  half  of  the  function  evaluations  thereafter.  The  number  of 
function  evaluations  saved,  Ng,  can  always  be  computed  from  the  returned 
N FUN : 

(7)  Ns  = /[  NFUN-9  J -2  x ^NLUN'l_9  jj  (NFUN-9) . 


2.2  Queue  Stacking.  If  the  initial  phase  does  not  result  in  exiting 
from  the  routine,  the  9-point  estimate  and  its  upper  and  lower  bounds  are 
placed  in  a circular  stacking  queue.  Subintervals  on  this  queue  are  pro- 
cessed by  subdividing  them  into  two  equal  intervals  and  applying  to  each 
half  the  basic  scheme  described  in  2.1.  It  is  thus  possible  that  none, 
one,  or  both  halves  of  a subinterval  are  placed  back  on  the  queue,  with 
the  original  one  being  removed.  Additional  tests  compare  the  sum  of  the 
estimates  of  each  half  with  that  of  the  whole.  If  any  subinterval  passes 
an  accuracy  test,  the  estimate  of  its  integral  is  accumulated  in  the  var- 
iable SUMY  in  this  routine,  and  is  not  placed  back  on  the  queue. 

If  the  queue  is  empty,  SUMY  is  returned  as  the  final  estimate  Y. 
Otherwise,  the  process  continues  until  LIMIT  is  reached,  the  queue  is 
filled,  or  estimates  are  found  from  the  acceleration  procedure. 

2.5  Acceleration  Procedure.  The  routine  knows  when  the  set  an  en- 
tire set  of  subintervals  on  the  queue  has  been  processed,  i.e.,  the  set 
has  been  subdivided  and  new  estimates  have  been  found.  This  is  called 
the  end  of  a cycle.  The  sum  of  the  new  estimates  plus  SUMY  constitutes  a 
new  total  estimate  of  the  integral  at  the  end  of  a cycle.  These  estimates 
form  a sequence  which  is  amenable  to  acceleration.  The  identical  acceler- 
ation algorithm  as  described  for  FOGIE  l 1J  is  used  here,  consisting  primarily 
of  repeated  applications  of  Aitken's  S-squared  transformation.  Such  a pro- 
cedure has  been  found  to  be  very  useful  for  accelerating  these  sequences  to 
their  limits  for  cases  that  have  end-point  singularities.  If  the  acceler- 
ation criteria  are  satisfied,  NL9  returns  with  an  answer  even  if  intervals 
still  remain  on  the  queue. 

2.4  Switching  to  a 96-point  Gaussian  Formula.  If  the  queue  becomes 
filled  (16  entries)  , it  is  assumed  that  the  integral  contains  high  frequen- 
cies, so  that  the  low-order  formulas  used  are  not  efficient.  Contiguous  sub- 
intervals are  combined  and  replaced  on  the  queue  and,  thereafter,  only  a 
96-point  Gauss- Legendre  formula  is  applied  in  all  cases.  The  best  estimate 
obtained  with  the  9-point  formulas  is  stored  for  comparison  with  that  of  the 
96-point  formula  at  the  end  of  the  next  cycle.  If  acceleration  is  perform- 
ed, it  is  with  estimates  obtained  only  with  96-point  formulas. 
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If  the  queue  becomes  filled  while  the  96-point  formula  is  being  used, 

NL9  is  aborted.  Either  the  integral  does  not  exist,  or  such  high  frequencies 
are  present  that  INTVL<-1  should  be  used. 

3.  CRITERIA  FOR  REMOVAL  FROM  QUEUE.  Four  possible  tests  are  used  to 
determine  whether  a subinterval  should  be  removed  from  the  queue: 

1.  If  |l-^-Ij|  <_  1 15|*10-15  or  (bypassed  when  NFUN=3,  if  13=0 

or  there  was  underflow) 


I9-I5I  i | I9I *10"*5 


2.  For  NFUN>9,  and  with  Y the  best  estimate  to  date, 

I ^ 3 i i ACC/10  ' (ACC>0)  or  (bypassed  if  the  function  has 

never  oscillated) 

(9)  I i 5 I or  U9I  1 ACC  | Y j / 1 0 2 

I I 3 I 1 - ACC  * | Y | / 1 0 4 (ACC  <0)  or 
I 1 5 i or  ! I g | < - ACC* | Y | / 1 0 3 

5.  If  1 1,-1.  | < ACC/EK 

, b (ACOO) 
or  j 1 5” I y J < ACC/EK 

(10'  or 

If  II3-I5I  < -ACC I Y I /EK 

i,  , , ~ , (ACC<0) 

or  | I5-I9 I 1 -ACC|Y|/EK 

where 

EK  = “if  ) 1 3- 1 s | > 0 . 1 1 1 s | or  |l5-Iy|  >0. 1 ( Ig ( 

UJ  tK  = 100  if  |I5-I5(>10-3|I5|  or  1 15- lg | > 10"  ■ 3 | Ig | 

EK  = 10  if  |I3-I5|>10-5|IS|  or  | I 5 - I g | > 1 0" 5 | I g | 

EK  = 1 if  10-5|I5|>|I3-I5|  or  10- 5 ( Igl > | Ifi- Ig | 

Thus,  this  condition  depends  on  the  relative  error  of  successive  estimates. 

4.  Wnen  each  component  is  tested  after  subdivision  for  tests  1,2,  and 
3,  the  sum  of  9-point  estimates  for  the  left-  and  right-hand  sides  forms 
an  18-point  estimate  I^g.  This  estimate  is  then  compared  to  the  9-point 

estimate  of  the  original  subinterval.  An  identical  test  to  3 is  made,  with 
Ijg  replacing  Ig  where  it  appears  in  both  (10)  and  (11). 
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These  tests  are  performed  as  the  various  quantities  and 

Ijy  are  found.  If  any  subinterval  is  removed  from  the  queue,  of  course, 
subsequent  tests  are  bypassed. 

If  a 96-point  formula  is  used,  tests  1 and  4 are  made  with  EK  = 10. 

These  tests  determine  whether  or  not  any  subinterval  is  put  back  in 
queue.  If  the  queue  becomes  empty,  control  is  returned  to  the  calling 
program. 

4,  END-QF-CYCLE  TESTS.  At  the  end  of  each  cycle  of  subdividing, 
the  best  estimate  Kn  is  compared  to  the  best  estimate  Kn_i  of  the  pre- 
vious cycle.  If 

(13)  10*lKn-Kn-ll  1 ACC  if  ACC  > 0 

or  10*|Kn-Kn  l|  <_  -ACC*|Kn|  if  ACC  < 0 

(provided  l^n'^n-1^  — 0 . 1 | Kn I » if  no  oscillations  are  present)  then 

Kn  is  adjudged  to  pass  the  test  and  is  returned  as  the  answer.  If  it  does 

not  pass  the  test,  the  sequence  of  Kj,  j=l,n,  are  transformed  into  a new 
sequence  kj,j  = 3,n  by  means  of  the  Aitken  62  transformation  as  in  FOGIE  [ 1 J , 
and  the  test  (13)  is  applied  to  the  values  K^  and  Repeated  trans- 

formations are  obtained  and  repeated  tests  are  made  as  long  as  there  are 
at  least  three  elements  left  after  the  first  transformation.  The  acceleration 
is  bypassed  when  n<4  initially,  or  if  the  difference  sequence  is  not  mono- 
tonic decreasing  in  magnitude.  In  the  latter  case,  only  the  two  best 
values  are  retained,  with  K2  set  to  Kn  and  Kj  to  K j,  and  a new  sequence 
is  begun.  Acceleration  may  then  be  attemped  after  two  more  cycles  of 
subdividing. 

5.  ERROR  ESTIMATES.  Each  time  a subinterval  is  removed  from  the 
queue,  it  introduces  an  error,  the  sums  of  the  squares  of  which  are  ac- 
cumulated. The  magnitude  of  the  component  of  error  is  adjusted  empirically, 
depending  upon  the  method  by  which  it  was  removed  from  the  queue.  For 

test  number  1,  the  magnitude  is  taken  to  be  the  left-hand  quantities  of 
(8)  multiplied  by  10.  For  other  tests,  the  multiplying  factor  . EK.  The 
final  error  is  taken  to  be  the  square  root  of  the  accumulated  sum  of  square^ 
of  these  components,  modified  in  an  attempt  to  prevent  underestimating  the 
error.  If  its  value  already  exceeds  ACC  (or  -ACC* ( Y | i f ACC<0)  the  IER  is 
returned  with  the  value  2.  Otherwise,  the  error  estimate  is  reset  to 

(14)  ERROR  = MIN (20*ERR0R,  ACC) 
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Although  empirical,  the  estimate  worked  very  well  in  almost  all  cases 
tested.  The  actual  value  is  not  too  important.  Keeping  track  of  its 
value  internally  permits  the  return  of  IER  = 2,  if  the  user  requests 
an  asbsurd  accuracy  requirement.  Because  test  number  1 will  insure 
an  empty  queue  eventually,  no  matter  what  accuracy  is  specified,  the 
main  use  of  ERROR  is  to  alert  the  user  to  these  situations. 

6.  TEST  AND  RESULTS.  A wide  variety  of  test  integrals  were  used 
to  tune  the  various  constants  required.  Some  of  those  used,  in  addition, 
were  selected  because  test  results  with  them  have  been  published,  thus 
providing  some  means  of  comparing  NL9  with  other  routines.  All  tests 
used  1NTVL=1  and  LIM1T=10000. 

The  tests  consisted  of  three  groups  of  specific  integrals: 

1.  21  integrals  used  by  Kahaner  [21.  These  were  tested  with 

ACC  = ^10-3,  j+10"6,  j+10-9,  +10"  . Comparisons  are  available 

with  some  other  routines  for  ACC  = 10'3,10'6,  and  10"^. 

2.  50  integrals  used  by  Casaletto,  Picket,  and  Rice,  and  pub- 
lished in  [3].  These  were  tested  for  ACC  = +^10" 1 through 
+10"9.  Comparisons  are  available  only  with  the  routine  FOGIE. 

3.  15  integrals  mentioned  specifically  by  other  papers  in  the 
field  ([4]  through  [6]),  or  included  just  to  see  what  happened. 
They  are  listed  in  Appendix  A,  and  were  tested  for  ACC  = j+10"1 
through  +^10"  9 . 

In  addition  to  these  tests,  some  parameter  sweeping  tests,  described  by 
de  Boor  [5],  were  made  for  ACC  = +10"6: 

n i 

(a)  \ x“  dx  for  a = -.li(-J-)  2 

d0  32  32 


^ ~2-2a^2  dx  for  a = 0(1)100 

(c)  ^ [l+C0S(aTTX)  ] dx  for  a = 0(i)^ 


(d)  ^ (-In  x)“  dx  for  a = -0.99(0.91)2 

(e)  C U(a-x)  dx  for  a = 0.01(0.01)1 
(JO 
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Tests  a to  c can  be  compared  to  CADRE  for  ACC  = 10' 6 , and  tests  a 
through  d can  be  compared  to  FOGIE  for  ACC  = 10'6. 

Results  for  the  three  groups  of  specific  integrals  are  summarized 
in  Tables  1 through  5. 

For  Kahaner's  test  integrals,  failures  occurred  only  with  case  21, 
which  caused  failures  for  most  routines  Kahaner  tested.  Comparisons 
can  be  made  with  CADRE  (easily  the  "winner"  of  the  Romberg  extrapolation 
routines  with  results  of  this  test  published)  for  ACC  = 10" 3 , 10'6,  and 
10-9,  and  FOGIE  for  those  same  accuracies.  Counting  a failure  for  each 
routine  for  the  same  case  as  a tie,  and  the  number  of  function  evaluations 
as  the  measure  of  efficiency.  Table  6 summarizes  the  comparisons.  The 
relaxation  of  conditions  for  queue  removal  for  NL9  permitted  three  failures 
as  compared  to  zero  for  FOGIE,  but  the  efficiency  was  markedly  improved. 

NL9  is  comparable  to  CADRE  for  the  low  accuracy  (10-3)  but  appears  more 
efficient  for  the  higher  accuracies  (10_6and  10'9).  FOGIE  still  seems 
most  efficient  for  the  high  accuracy  of  10-9,  where  the  minimum  number 
of  function  evaluations  of  24  is  not  detrimental. 

Tables  2 and  3 show  results  for  the  test  integrals  of  Casaletto, 
Picket,  and  Rice.  Complete  failure  occurs  for  case  47,  where  a very 
narrow  discontinuous  band  is  superimposed  on  a low-degree  polynominal. 

Most  routines  fail  on  this  case  because  no  points  are  sampled  in  the 
discontinous  region  early  enough.  (FOGIE  does  sample  points  in  this 
region,  but  fails  on  some  of  these  cases  because  of  its  unreliability 
for  discontinous  cases.)  It  is  clear,  however,  that  NL9  is  vastly 
superior  to  the  Clenshaw-Curtis  algorithm  described  by  Gentleman  [3] 
from  the  qualitative  description  of  the  results  of  those  tests.  There 
were  many  failures  as  well  as  many  cases  (end-point  singularities  and 
discontinous)  which  could  not  be  computed  by  the  algorithm  in  10000 
function  evaluations.  The  maximum  required  by  NL9  was  2103.  Further, 
their  statistical  test  on  the  "smooth"  cases  shows  NL9  is  more  efficient 
for  accuracies  greater  then  10' 3 . No  comparisons  are  available  for 
relative  error  tests,  where  failure  still  occurs  with  case  47.  Also, 
as  expected,  case  50  always  returned  IER  = 2.  Since  the  answer  is  0, 
no  result  can  meet  this  requirement.  This  is  a failure  in  a sense, 
but  was  not  so  indicated. 

Few  comparisons  are  available  with  the  test  integrals  in  Group  3, 
summarized  in  Tables  4 and  5.  The  only  failure  with  absolute  error 
occurs  with  case  11,  ACC  = 10-9.  The  internal  point  of  singularity 
is  actually  hit  exactly  (64  bits)  and  causes  overflow  after  many 
interval  bisections.  The  results  shown  is  when  f(x)  = 0 at  x = 0.2 
was  used  to  prevent  overflow.  For  the  relative  error  runs  (Table  5), 
this  case  also  presents  difficulty.  IER  = 4 is  returned  indicating 
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a filled  queue.  Case  6 had  many  runs  returned  with  IER  = 2,  although 
the  accuracy  was  met.  This  results  from  loss  of  digits  due  to  summing 
nearly  equal  postive  and  negative  terms.  One  can  expect  IER  = 2 in 
such  cases  when  relative  error  is  used.  The  overall  results  of  these 
test  cases  indicates  both  excellent  reliability  and  efficiency. 

Figures  1 through  10  show  the  test  results-  for  cases  where  para- 
meters were  swept.  The  graphs  are  paired  for  each  case  --  one  showing 
the  number  of  functions  evaluatioi  and  one  the  true  error.  Only  re- 
sults for  ACC  = 10"6  are  given  because  they  can  be  compared,  in  part, 
to  both  F0G1E  and  CADRE.  Results  for  ACC  = -10  6 will  be  described  if 
they  are  significantly  different.  Downward  or  upward  arrows  indicate 
cases  out  of  range  of  the  scale. 

Figures  1 and  2 are  for  case  a,  the  powers  of  x.  No  failures  were 
obtained  in  the  entire  range  (for  relative  errors  also) , with  the  aver- 
age number  of  function  evaluations  slightly  less  than  FOGIE  and  CADRE. 
(CADRE  fails  for  some  cases.)  All  of  the  cases  where  NFUN  >_  63  returned 
successfully  as  the  result  of  the  acceleration  process.  This  case  can 
be  considered  completely  successful. 

Figures  3 and  4 are  for  case  b,  the  large  central  peak.  Again, 
there  were  no  failures  with  the  true  error  about  constant.  The  number 
of  function  evaluations  is  about  one-third  that  of  FOGIE  and  slightly 
less  than  CADRE  (CADRE  fails  for  a>30) . Some  of  that  difference  results 
from  the  integrand  being  a symmetric  function.  The  results  for  relative 
error  are  similar,  but  NFUN  was  greater  by  more  than  a factor  of  2.  This 
occurred  because  the  large  value  of  the  peak  is  not  apparent  until  many 
cycles  of  subdividing  and  tends  to  keep  intervals  on  the  queue  more  than 
necessary.  This  case  can  also  be  considered  completely  successful. 

Figures  5 and  6 are  for  case  £,  the  oscillatory  integrand.  There 
were  no  failures,  with  absolute  and  relative  eTror  test  being  identical, 
a reasonable  result  since  the  answer  is  always  near  the  value  1.  Results 
are  somewhat  superior  to  FOGIE,  because  of  the  switch  to  the  96-point 
Gauss- Legendre  formulas  for  a filled  queue  for  the  larger  frequencies. 

They  are  vastly  superior  to  CADRE,  which  fails  in  many  cases  because  it 
concludes  the  integrand  is  linear.  Note  that  NFUN  was  quite  small  for 
many  values  of  large  a.  For  integral  a,  the  integrand  can  become  anti- 
symmetric after  some  subdivision.  NL9  recognizes  this  situation  and  stops 
more  quickly  than  it  would  for  nonintegers  near  a.  This  case  also  can 
be  considered  completely  successful. 

Figures  7 and  8 show  results  for  case  d,  the  nonalgebraic end-point 
singularity.  This  case  has  singularities  at  both  ends  when  a<0  and 
is  considerably  more  difficult  to  compute  numerically.  The  true  error 
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graph  indicates  12  failures  near  a = -l  (FOGIE  had  2)  with  the  arrows 
for  runs  that  returned  IER  = 4.  In  the  region  of  failure  the  correct 
values  of  the  integrals  are  greater  than  10.  Some  of  these  failures 
had  as  many  as  six  correct  digits.  Only  eight  failures  occured  with 
the  relative  error  specifications,  all  with  IER  =1.  At  least  four 
digits  were  correctly  obtained  by  the  acceleration  process  in  those 
runs.  This  case,  while  only  partially  successful,  is  considered  satis- 
factory because  of  the  difficult  nature  of  the  integral. 

Figures  9 and  10  are  for  case  e,  the  jump  discontinuity.  No  fail- 
ures were  obtained,  indicating  good  reliability.  No  comparisons  with 
other  routines  are  available,  but  from  other  cases  run,  it  is  evident 
that  the  efficiency  is  not  as  good  as  CADRE.  Still,  the  purpose  of  using 
near-Lobatto  formulas  was  to  improve  reliability.  It  is  doubtful  that 
routines  based  on  Gaussian-type  formulas  can  be  as  efficeint  as  routines 
using  Romberg  extrapolation  for  detecting  jump  discontinuities. 

7.  CONCLUSIONS.  From  the  results  of  tests  described  here,  we  con- 
clude that  NL9  is  an  effective  routine  for  computing  numerical  quadrature. 
Overall  reliability  is  excellent  for  all  types  of  integrands  encountered 
in  practice.  Overall  efficiency  is  quite  good,  especially  for  problems 
with  end-point  singularities  and  high  accuracy  requirements.  These 
assessments  suggest  that  NL9  is  suitable  as  a general-purpose  quadrature 
routine  that  will  satisfy  most  users. 

Ultimately,  a routine  with  keywords  --  similar  to  the  control  pro- 
gram designed  with  EISPACK2  [7]--  might  be  desirable  for  unsophisticated 
users.  Information  about  the  integrand,  such  as  "smooth,"  "singularity," 
"oscillatory,"  "discontinuous,"  and  "peak"  can  be  useful  in  selecting  a 
specific  routine  to  get  good  efficiency.  For  example,  a "smooth"  integrand 
is  probably  best  attacked  by  Patterson- 1 ike  formulas  [4],  whereas  one 
with  end-point  singularities  should  contain  the  capability  of  subdivision 
and  acceleration.  NL9  might  represent  a procedure  to  be  used  if  no  key- 
words are  supplied,  i.e.,  if  the  nature  of  the  integrand  is  unknown.  There 
is  then  an  excellent  chance  that  the  answer  is  correctly  obtained  in  a 
reasonably  efficient  manner. 
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Figure  1.  Function  evaluations  for  sweep  of  powers  of 
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Figure  2.  True  error  for  sweep  of  powers  of 


Figure  3.  Func'  evaluations  for  sweep  of  central  peak. 


Figure  4.  True  error  for  sweep  of  central  Peak. 
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Figure  6.  True  error  for  sweep  of  oscillatory 
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Figure  7.  Function  evaluations  for  sweep  of  nonalgebraic  end-point  singularity. 


TABLE  1 


RESULTS 


Value  Shown  is  retur 
All 

All  true  errors  < es 


ACC 

case\. 

io- 3 

io-6 

10“  9 

i 

9 

9 

27 

2 

129 

249 

369 

3 

51 

63 

99 

4 

9 

9 

18 

5 

18 

36 

36 

6 

9 

45 

81 

7 

109 

63 

99 

8 

9 

27 

63 

9 

235 

387 

951 

10 

9 

9 

27 

11 

5 

9 

27 

12 

5 

9 

9 

13 

375 

567 

567 

14 

87 

141 

155 

15 

75 

129 

171 

16 

149 

207 

351 

17 

335 

567 

567 

18 

135 

135 

279 

19 

59 

63 

99 

20 

18 

36 

36 

21 

145F 

225F 

347F 

389 


KAHANER'S  INTEGRALS 


-IO' 3 

-10-*' 

-io-4 

-10" 

27 

9 

9 

9 

27 

*89 

141 

261 

369 

489 

99 

51 

63 

99 

99 

18 

9 

9 

18 

18 

72 

18 

36 

36 

72 

99 

23 

63 

99 

99 

315 

441 

63 

99 

351 

63 

9 

27 

63 

63 

951 

279 

459 

951 

951 

63 

9 

27 

27 

63 

27 

5 

9 

27 

27 

27 

5 

9 

9 

27 

567 

567 

567 

567 

951 

231 

89 

153 

261 

269 

297 

77 

147 

317 

437 

459 

153 

297 

351 

675 

951 

375 

567 

567 

951 

375 

135 

207 

315 

375 

135 

163 

63 

99 

171 

72 

18 

18 

36 

72 

901 

171F 

297F 

905 

1593 

TABLE  2 RESULTS  WITH  ABSOLUTE  ERROR  TOR  INTEGRALS  OF  CASALETTTO, 
PICKET,  AND  RICE 

All  IER=1.  Value  shewn  is  NFUN.  F indicates  Failure. 
All  true  errors  < estimated  errors  except  for  Failures. 


\ ACC 
casbN. 

10" 1 

10"J 

10"  3 

-X 

1 

O 

1 < 

10"  ? 

10"t 

10"  7 

10"e 

10"9 

1 

3 

3 

3 

3 

3 

3 

3 

3 

3 

2 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

5 

5 

5 

5 

5 

5 

5 

5 

5 

4 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

9 

9 

9 

9 

9 

9 

9 

9 

9 

6 

5 

9 

9 

9 

9 

9 

9 

9 

9 

7 

9 

9 

9 

9 

9 

9 

9 

9 

9 

8 

9 

9 

9 

9 

9 

9 

9 

9 

9 

9 

9 

9 

9 

9 

27 

27 

27 

27 

27 

10 

9 

9 

9 

9 

27 

27 

27 

27 

27 

11 

9 

9 

27 

27 

27 

27 

27 

27 

27 

12 

9 

9 

27 

27 

27 

27 

27 

27 

27 

13 

9 

9 

27 

27 

27 

27 

27 

27 

27 

14 

5 

9 

27 

27 

27 

27 

27 

27 

27 

15 

9 

9 

27 

27 

27 

27 

27 

27 

63 

16 

5 

27 

27 

27 

27 

27 

27 

27 

63 

17 

9 

27 

27 

27 

27 

27 

27 

63 

63 

18 

5 

27 

27 

27 

27 

27 

27 

63 

63 

19 

9 

27 

27 

27 

27 

27 

63 

63 

20 

5 

27 

27 

27 

27 

27 

63 

63 

63 

21 

5 

5 

9 

9 

9 

9 

9 

9 

27 

22 

9 

9 

9 

9 

9 

18 

18 

18 

18 

23 

5 

5 

9 

9 

9 

9 

9 

9 

9 

24 

5 

5 

5 

9 

9 

9 

9 

9 

9 

25 

9 

9 

9 

9 

9 

27 

27 

27 

27 

26 

135 

163 

235 

279 

379 

387 

459 

951 

951 

27 

9 

9 

9 

27 

27 

27 

27 

63 

63 

28 

5 

5 

5 

9 

9 

9 

9 

9 

27 

29 

453 

375 

375 

375 

567 

567 

567 

567 

567 

30 

777 

951 

951 

951 

951 

951 

951 

951 

951 

31 

541 

567 

567 

1335 

1335 

1335 

1335 

1335 

1719 

32 

5 

5 

9 

9 

9 

9 

9 

18 

18 

33 

18 

18 

18 

18 

18 

36 

36 

36 

36 

34 

567 

567 

567 

567 

1719 

2103 

1719 

1719 

1719 

35 

9 

9 

9 

9 

9 

9 

27 

27 

27 

36 

23 

23 

51 

59 

109 

63 

63 

63 

99 

37 

23 

37 

51 

59 

63 

63 

63 

63 

99 

38 

23 

37 

51 

55 

109 

63 

63 

63 

99 

39 

23 

37 

51 

55 

109 

63 

63 

63 

99 

40 

55 

55 

83 

115 

135 

135 

135 

135 

207 

41 

9 

9 

9 

27 

27 

45 

63 

63 

81 

42 

19 

27 

27 

55 

63 

95 

135 

135 

135 

43 

9 

9 

9 

9 

27 

27 

45 

45 

63 

44 

9 

23 

23 

27 

27 

45 

63 

63 

99 

45 

15 

15 

15 

15 

15 

15 

15 

15 

15 

46 

3 

3 

3 

3 

3 

3 

3 

3 

3 

47 

5F 

5F 

5F 

5F 

5F 

5F 

5F 

5F 

5F 

48 

35 

49 

89 

171 

199 

263 

289 

337 

365 

49 

9 

9 

9 

9 

9 

9 

9 

9 

9 

50 

3 

3 

3 

3 

3 

3 

3 

3 

3 
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APPENDIX  A 


GROUP  3 TEST  INTEGRALS 


CASE 

INTEGRALS 

am  POINTS 

1 

1 

1+x2 

(-100, +100) 

2 

x-5,  5<x<10 

f (x)  = 15-x  10<x<15 

0 otherwise 

(0,20) 

3 

x sin  (72*x) 

(0,1) 

4 

Sin2(50rrx)/  [j>0(nX)?J 

(0,10) 

5 

J1  (x) 

(0,100) 

6 

sin(i)/x2 

(0.00499999995,1) 

7 

1/  "Y  1 x2-l  | 

(-1,1) 

8 

4 ti  2xsin20TTX  cos2ttx 

(0,1) 

9 

1/  £l  + (230X-30)  2] 

(0,1) 

10 

1/x5 

(0.01,1.  10) 

11 

1/  "V  1 X-.  2 1 

(0,1) 

12 

x log  sin  x 

(0,1) 

13 

log  (1/x) 

(0,1) 

14 

|x-.6| 

(0,1) 

15 

x arc  cos  (i)  ~\l  l-(x-l)2 

(1,2) 
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IF ( Yt  .1  ! . l.P-b* YA | oL 1 < * 

«;9Jg*s  i: 

F K = 1 0 . 

0 J 3 C 4 3 c 3 

Gut;  io 

u J 3 C * 3 3 ) 

2* 

F K = l . 

o no*  7 *f  o 

GJfU  10 
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0.1  T.  1 / 

o : J U * b 9 3 

c 

GuNTkJt  F <»•  K-3. 

3 0 C 3*  ri  :o 

*D 

00  lu  119,1*1,  IT 
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* 7 31  c9<4  5 / I t ti  7 ) J * . 7 ‘f  c"i  t AF  I 7 f j 2 J | , 7 3»  3 K :>*  l o < 7 r F /it  F 

* 7 3V  3 1jC  5f  ? A a A F i A 7 7 , / U j»  /i  Foc^iAl  *y '»  yj  , 31  / t il  t L rj  73L 
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c 
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c 
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T 
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f STIMATF. 
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c 
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FILE  NO  - l 
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c 
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M.U 1 lf». 
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: jjcs  /s: 

C 

] S » 1 = «.  will  It.  St  1 n-  1NOICAU  a aYFREIkIC 

F CM 

III  . b *■ ' R THE 
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c 

i-IRST  INTERVAL, 

J3S JSi I J 

1 S « 1 = 1 

c 

SE  I UP  SYMH.LlPlL  bCNCTluN  INDICATE*. 
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l 3w  1 = 2 
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c 
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• 
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F ILL l l«  l »=l  <L  ♦ <U) *.SUC 
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c 
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3 
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St)«Y  AclUKUl  AHS  I hi  liNllul'Al  cSll“»Hs  hC*  |\II  • V »l  S •'iwrvi,  •=  «i  .'3  3 305  U 3 
FH>  JUfcllt,  0330S72., 

su*7=o  .00  c o:Ci>jr. 

II  lb  Tit  INI'!  X F'JS  l. 

u = i :o  cctcii  c 

I’fUIib  nfw  rmr.  lYClt  i-a-ihs  I hi  i n.>  it  a ail  i:f  imlrvalS.  jjKibt: 

crcit  * ^iiiM  ooccss/o 

SUhU  AllGMUiAlfS  Tht  HlllO-iaL  ISflt'afiS  FCR  INItKVALS  PL4CIL  MLK  C00C64I33 
ON  T He  1,'UfUr.  300oS490 

SU*Q  ^ 0.00  C0CC6CCC 

EVALUATE  NEW  INTERVAL  lilt  Tht  SIACN.  fl-bl  i.HcCK  II-  I.UKJ'  IS  Ml  Ltl.t)0JG631C 

if(nf  lit  .re  .til'll  tii  ii  o :o:o602j 

LPT  = LFF!  CiiU  POINT  . FPI  = 'I  I OH  T cM'  FLINT,  E = ESIlPArr  cF  CCC160JO 

liMHOKftl  BETWEEN  LPT  xpr,  C3006C4G 

LPT  = F ILtlUl  END,  II  0 0306CS3 

RPT  = F 11  F ( F 1 1 t M),  21  30006360 

t = F III  IF  IlFfJU,  31  GCCC60T0 

UPT  A T F INTf'RVAlS  ON  Til.  „i.ftj|£.  333C6JEJ 

NF  I L t =Hl  III  -i  vtPjCOCSG 

UPDATE  F ILF  NO  PUMTFP.  OJJO6IC1. 

IF  It  ILFNO.t  L.LtNI  F ILLNi'sJ  333061  10 

F IL1  NU=F  ILLNHt  I 00006  120 

CHECK.  FOR  ALJUSIPEM  Oti.-  U SYPPIIAIL  FUNC  1 I >’N  THE-  I I, .SI  TIPI  33306133 

THRIUGH.  3-J3614; 

iMIShl.CU.il  t-OTO  lb  00CC61SC 

P l U=  RP  T 3 3 3Ctlt3 

RARE  A=C.OC  33336  1 70 

GuTl-  1 7 3 30J6  ISO 

MID-ILPTfPPTI*.  600  33336143 

CllPPjTt  THt  TF  ANlbHJkP4 1 I IN  Fl.»»'l  Ii  Jj  FOR  TFr  I FI  IMt-vAL.  C0UC6233 

A=  1ML-IPT  l*.6C'3  333C6213 
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RE  St  I 3 BEFiFL  RETURNING.  0J0C62FC 

IFIISW.Fy.2l  J-l  00006243 

GUTt1  71  33306  333 

K 1 = 2 HFAN'S  THt  LLf  r U.IERVAl  will  rr  RFPCVtl.  IMF  TFt  QUEUE.  C30C631C 
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COMPUTE  THt  TF  ANSFDRP4T  II  i\  pafAFEIFpS  Ft“  TFl  tllll  l.bltPvAl.  03  306  340 

A=  ISPT-MI(3l*.5UU  33306  3b3 
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*2=2  NT  AN  b THE  RIGHT  INTI  J VA  t WILL  Ft  fFPlVEL)  FeIv  THE  CoFUl.  3 1 3364  13 
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CALL  REMOVE  1 YE  ST, Y»2 ( I 21 , YE .tPKCh .tRK.K, ILKI  33306443 
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IMILK.iU.tl  (,(  r.,?  sc 
n l = 2 

K ' = ? 
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SO  IHM.U.ll  (»( < I J bb 
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PJT  LFM  INTfPVAl  HACK  PE  hit  CUPLr. 
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FILL  IPO  INT  , 2l  = Ml> 
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ON  THE  OPTIMALITY  OF  THE  RAYLEIGH -RITZ  APPROXIMATION 


S.  C.  Eisenstat,  R.  S.  Schreiber,  M.  H.  Schultz* 

Department  of  Computer  Science 
Yale  University 
New  Haven,  Connecticut  06520 


Abstr ac  t 

Let  A be  a densely  defined,  self-adjoint  positive  definite  linear 
operator  on  a Hilbert  space  H.  Let  u be  the  generalized  solution  of 
the  equation  Au  - f,  where  f c H,  and  let  u,.  be  the  Rayleigh-Ritz 

approximation  to  u from  a finite-dimensional  subspace  S of  the  domain 
of  A.  We  show  that  if  the  residual  is  bounded,  i.e.,  there  exists 
B > 0 such  that,  for  every  g c H, 

||  g - Av  ||  _<  B ||  g || 

where  v is  the  Rayleigh-Ritz  approximation  to  the  generalized 
s 

solution  of  Av  » g,  then  the  Rayleigh-Ritz  approximation  is 
quasi-optimal  in  the  sense  that 

||  u - us  ||  < B inf  ||  u — v ||  . 

Vs 

For  A a uniformly  elliptic  second-order  differential  operator  and 
(S  a suitable  family  of  subspaces  of  the  domain  of  A,  we  show 

that  the  residuals  are  bounded  uniformly  in  h,  so  that  the 
Rayleigh-Ritz  approximations  are  uniformly  quasi-optimal.  In  the  case 
of  a two-point  boundary  value  problem,  we  remove  the  restriction  that 

S be  contained  in  the  domain  of  A,  and  show  that  the  Rayleigh-Ritz 
h 

approximation  is  as  good  as  any  interpolate  in  S,  of  the  solution. 

n 


1 . Introduc  tion. 

Let  A be  a densely  defined,  self-adjoint  positive  definite  linear 
operator  on  a Hilbert  space  H,  and  let  u be  the  generalized  solution 
to  the  equation  Au  = f,  where  f c H.  Nitsche's  trick  (Nitsche  [4]) 
has  been  widely  used  to  prove  optimal-order  H-norm  error  bounds  for 
the  Rayleigh-Ritz  method  for  approximating  u (see  Schultz  [5]  and  the 
references  therein).  Our  aim  here  is  to  establish  a stronger  result 
on  the  optimality  of  the  Rayleigh-Ritz  approximation. 


*This  research  was  supported  in  part  by  NSF  Grant  MCS  76-11450  and  ONR 
Crant  N0001 4-76-0277. 
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Let  Ug  be  the  Rayleigh-Ritz  approximation  to  u in  a finite- 

dimensional  subspace  S of  the  domain  of  A.  In  Section  2 we  prove 
that  the  Rayleigh-Ritz  approximation  is  quasi-optimal  (in  the  H norm): 

(1.2)  ||  u — ug  ||  < B inf  ||  u - vg  ||  . 

Vs 

provided  there  exists  a positive  constant  B such  that,  for  every 
R e H, 

Hr-  Avg  II  < B II  g II  , 

where  Vg  is  the  Rayleigh-Ri tz  approximation  to  the  generalized 

solution  of  Av  = g.  In  Section  3 we  show  that  if  A is  a uniformly 

elliptic  second-order  differential  operator  and  {S.  },  „ is  a suitable 

h h>l) 

family  of  subspaces  of  the  domain  of  A,  then  the  residuals  are  bounded 
uniformly  in  h.  Therefore,  the  Rayleigh-Ritz  approximations  are 
quasi-optimal  uniformly  in  h.  In  particular,  the  result  applies  to 

subspaces  of  C*-piecewise  polynomials.  In  Section  4 we  consider 

self-adjoint  two-point  boundary  value  problems.  Without  requiring 

that  SL  be  a subspace  of  the  domain  of  A,  we  show  that  the  error  in 
h 

the  Rayleigh-Ritz  approximation  is  within  a constant  factor  of  the 

error  in  any  interpolate  in  S,  of  the  solution. 

n 


2.  H-norm  error  bound s . 

Let  H be  a Hilbert  space  with  inner  product  (...)  and  norm  ||  . ||  , 
and  let  A be  a linear  operator  on  1!  such  that 

1)  A is  densely  defined,  i.e.,  the  domain  T)(A) 
of  A is  dense  in  H,  and 

il)  A is  self-adjoint  and  positive  definite. 

We  define  the  inner  product 

( u , v ] = (Au,v),  u,v  e H(A) , 

and  the  corresponding  norm 

II  v ||  A = [v,v]1/2  =*  (Av.v)1^,  v tti(A), 

The  completion  tIA  of  T)(A)  with  respect  to  ||  . ||  is  Itself  a Hilbert 
space  with  inner  product  Furthermore, 


(2.1)  [v,u]  = ( Av , u ) , v e D(A),  u e H 

A 

( see  Mikhl in  [ 3) ) . 

Let  f e H and  consider  the  linear  equation 

(2.2)  Au  » f . 

The  generalized  solution  of  (2.2)  is  the  unique  u c HA  satisfying 

A 

(2.3)  [u,v]  - (f,v)  for  all  v e H . 

A 

For  every  finite-dimensional  subspace  S of  HA , the  Rayleigh-Ritz 

A 

approximation  ug  to  u is  the  unique  element  of  S satisfying 

(2.4)  [u  - Ug,Vg]  *•  (f  - Aus*vs^  “ 0 for  a*1  vs  c S‘ 


We  now  give  a general  condition  on  the  subspace  S which  is 
sufficient  to  make  the  Rayleigh-Ritz  approximation  quasi-optimal. 

Theorem  1.  Let  S be  a finite-dimensional  subspace  of  11(A)  with  the 
property  that  the  residual  in  the  Rayleigh-Ritz  approximation  is 
bounded;  i.e.,  there  exists  a constant  B > 0 such  that,  for  every 
g c H the  Rayleigh-Ritz  approximation  Vg  to  the  generalized  solution 

of  Av  - g satisfies 

(2.5)  II  g - Avs  ||  < B ||  g ||  . 


Then  the  Rayleigh-Ritz  approximation  is  quasi-optimal  in  the  H-norm  in 
Lhe  sense  Lhat 


(2.6)  ||  u - Ug  II  < B inf  ||  u - vs  ||  . 

Vs 

Proof:  Following  Nitsche  [4] . let  e„  » u - u„,  let  w c H.  be  the 
ho  A 

generalized  solution  of  (2.2)  with  right-hand  side  eg,  and  let  Wg  e S 
be  the  Rayleigh-Ritz  approximation  to  w.  By  (2.4) 


so  that 
(2.7) 


tvV  " °* 


- (es.es)  - [w,es]  - [w  - Wg.eg], 


At  this  point  in  the  argunent,  earlier  authors  applied  the  Cauchy- 

Schwarz  inequality  and  used  approximation-theoretic  results  to  bound 

||  w - w_  ||  , and  II  e_  ||  , ( cf . [4  ] , [ 5 ] ) . Instead  , we  continue  , letting 

S A S A 
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Ug  denote  an  arbitrary  element  of  S.  By  (2.4) 


(2.8)  (w  - v/g.Ug  - ug]  - 0. 

Adding  (2.8)  to  (2.7), 

(2.9)  ||  eg  ||  2 - [w  - wg,u  - ug] 

- [w,u  - ug)  - [w&,u  - Ug] . 


By  (2.3), 


|w,v)  - (e_,v)  for  all  v € H ; 

3 A 


and  since  SCD(A),  we  have  by  (2.1)  that 

(w  ,v]  • (Aw  ,v)  for  all  v c H . 

b b A 

Substituting  these  equalities  in  the  right-hand  side  of  (2.9), 

l|es" 2 " (es  ■ Aws’u  ■ UV- 

Thus,  by  the  Cauchy-Schwarz  inequality  and  (2.5),  we  have  that 

II  e ||  2 < ||  e - Aw  ||  ||  u - u || 

< B It  es  II  ||u  - ug  ||  . 


Since  Ug  was  arbitrary,  this  proves  (2.6). 


Q.  E.D. 


3.  Strongly  elliptic  differential  equations . 

Let  n be  a bounded  region  in  Euclidean  n-space  and  consider  the 
differential  equation 
n 


(3.1) 


Au 


• r 


iiL) 
1J  3Xj 


+ bu  = 


f. 


subject  to  the  Dirichlet  boundary  conditions 
(3.2)  u » 0 on  an. 

S S S ao 

Let  H - H ( n)  (respectively  Hg)  be  the  completion  of  the  C functions 

(respectively  the  C*  functions  with  compact  support  in  n)  with  respect 
to  the  norm 
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||  u II  2 - I /(QJu)2  dx, 

h j-o  n 

0 2 

and  let  ||  . ||  denote  the  norm  of  H « L ( ft)  . We  assume  that 
i)  - a^  ^ , 1 < i,J  < n, 

il)  e Cl(fi),  be  C°(ft), 

ill)  There  exist  positive  constants  X and  A such  that 

(3.3)  X ||  u ||  H , < (Au.u)1/2  < A ||  u ||H, 

for  all  u e H2(ft)  D Hg(ft). 

With  these  hypotheses,  H » H*(n). 

A 0 

Let  the  generalized  solution  to  (3. 1 )— (3. 2)  be  defined  as  in 
Section  2.  We  assume  the  following  regularity  result: 

iv)  There  exists  a positive  constant  K such  that,  if 
2 

f e L (ft),  then  the  generalized  solution  u is 
in  H2 ( ft)  and 

(3.4)  II  u ||  H,  < K ||  f ||  . 

(For  conditions  under  which  this  hypothesis  is  valid,  see  Schultz  [5, 
p . 1 03 ] , Birman  and  Skvortsov  [1],  and  Friedman  [2,  p.307].) 

Let  (S,  „ be  a family  of  finite-dimensional  subspaces 

n n>U 

of  D(A)  which  satisfy  the  following  "approximation  hypothesis":  There 

2 

exists  a positive  constant  c^  such  that,  for  all  u c H (ft)  and  all 

h > 0,  there  exists  a v,  e S.  satisfying 

h h 

(3.5)  Hu  - vh  II  s < Cjh2-8  ||  u 11  Hl,  s - 0,1,2. 

H 

We  require  further  that  S,  satisfy  the  "inverse  assumption":  There 

h 

exists  a positive  constant  c such  that,  for  all  h > 0 and  v e S , 

2 ti  h 

(3.6)  ||vhHH,<  c2'rl  |,vh  11  H>- 

These  assumptions  will  be  satisfied,  for  example,  when  S,  is  a 

1 1 
subspace  of  C -piecewise  polynomials  of  degree  2 with  respect  to  a 

uniform  mesh  of  size  h. 
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We  now  show  that  the  residual  in  the  Ravi eieh-Ri tz  approximation 
is  uniformly  bounded.  We  employ  a technique  of  Stephens  [6],  who 
proved  this  result  for  the  special  case  of  Poisson's  equation. 

Theorem  2.  If  the  operator  A satisfies  (i)-(iv)  and  the  suhspaces 

satisfy  the  approximation  hypothesis  (3.5)  and  the  inverse  assumption 

(3.6) ,  then  the  residuals  f - Au,  are  bounded  uniformly  in  h;  i.e., 

h 2 

there  exists  a constant  H independent  of  h such  that,  for  every  f e L 

and  all  h > 0,  the  Ravi  eip,h-Ri  tz  approximation  u,  e S satisfies 

h h 

(3.7)  II  f - Au,  ||  < R ||  f II  . 

h — 

Proof:  because  of  the  assumption  (ii)  and  the  smoothness  of  S , there 

h 

exists  a constant  M > 0 such  that 


< M II  u. 


h 11  H2' 


Thus,  in  order  to  bound  II  f - Au, 


it  suffices  to  bound  ||  u - u. 


f - Au, 


(3.8) 


< II  f II  + M ||  uh  ||  (]2 

< II  f II  + M ||  u II  (Ij  + M II  u - 

< (1P1K)  ||  f ||  + M ||  u - u,  II  „2 

h H 


where  we  have  used  the  regularity  assumption  to  bound  || u 


bet  v be  an  approximation  to  u which  satisfies  (3.5).  Rv  the 


trianqle  inequality. 


h " U2  - 


+ II  v,  - u. 


Because  the  Rayleiqh-Ri tz  approximation  is  optimal  in  the  norm  of  H , 

A 

it  follows  fron  (3.3)  and  the  approximation  hypothesis  that 


Hu  - uh II H,  < r ii u - ^ ii A 


(3. 10) 


< X_1A  ||u  - vh  II 

< Cj\  Ah  II  u II  1[2  . 


- 


Using  the  inverse  assumption  (3.6),  (3.5)  with  s - 1,  and  (3.10), 

11  Vh  - “h  11  11 2 * C2h_1  11  Vh  ' “h  11  H‘ 

(3-11)  < c2h-1(  ||  vh  - u ||  H,  + llu-un  llH,) 

< c2Cj  (1  + X 1 A ) II  U II  ||  2 . 

Now,  using  (3.5)  with  s = 2 and  (3.11)  to  bound  the  right-hand  side  of 
(3.9),  and  applying  the  regularity  hypothesis  (3.4),  we  have 

(3.12)  ||u  - u^  ||  Hi  < [1  + c2  (1  + A-1  A)]  ||  u ||  H* 

< Kcj  (1  + c2(l  + A-1  A)]  ||f  II  . 

Finally,  (3.7)  follows  from  (3.8)  and  (3.12). 

0. E.D. 


Corollary.  Under  the  hypotheses  of  the  theorem,  there  exists  a 
constant  B which  depends  only  on  the  operator  A such  that,  for  all 
h > 0, 

II  u — u ||  < B inf  ||  u - v ||  . 
h — c h 


The  result  of  the  theorem  also  inplies  that  the  residual 
converges  to  0. 


Corollary.  Under  the  hypotheses  of  the  theorem,  the  functions  Au 

2 h 
converge  to  f in  L ; i.e., 

II  Au^  - f ||  •*  0 as  h + 0. 

Proof:  Stephens  [6,  Corollary  2.1]  shows  that  Au^  converges  to  f if 

and  only  if  ||  Au^  ||  < B for  some  constant  B independent  of  h. 

Other  conditions  for  the  convergence  to  zero  (and  hence 
boundedness)  of  the  residual  have  been  obtained.  See,  for  example, 
Mikhlin  [3,  p.  147]. 
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4.  The  one-dimensional  case. 

In  this  section  we  consider  the  case  of  a second-order 

self-adjoint  elliptic  differential  equation  on  the  unit  interval 

I = [0,1].  We  remove  the  requirement  that  the  suhspace  S,  belong  to 

h 

the  domain  of  the  operator,  and  show  that  the  Rayleigh-Ri tz 

approximation  is  asymptotically  as  Rood  as  any  interpolate  of  the 

solution  in  S . 

h 


Consider  the  two-point  boundary  value  problem 

(4.1)  Au  - -hfafxlbu)  + b(x)u  = f in  (0,1), 

u(0 ) = u(l ) = 0 

where  0 i , 
dx 

i)  a e C*(I)  satisfies  a(x)  > 4 > 0 for  all  x e I, 


and 


ii)  b e C ( I ) satisfies  b(x)  > 0 


for  all  x e I. 


2 1 
Given  f e L (I),  the  generalized  solution  u e H^(I ) is  defined  by 

1 l 

(4.2)  a(u,v)  = / (afluflv  + buv)  dx  = (f,v)  = / fv  dx, 

0 0 

for  all  v £ 11^(1).  The  form  a(u,v)  is  strongly  coercive;  i.e.,  there 
0 2 

exists  a constant  K such  that,  for  every  f e L the  generalized 

2 

solution  U is  in  11  and 


II  u ||  H,  < K ||f  II 
(see  Schultz  [5,  p.l03j). 

Let  S be  a finite-dimensional  subspace  of  with  the  property 

that  there  exists  a partition  A:0  = x.  < . . . < x„  = 1 such  that 

0 N 

2 « ^ 

(4.3)  ||  Av  ||  = E / (Av.  ) dx  < « 

h A h 

i«l  x . , 

l-l 

for  all  v e S . This  is  essentially  equivalent  to  requiring  that  S 
h h 2 h 

consist  of  continuous,  piecewise-C  functions  which  satisfy  the 

boundary  conditions.  We  require  that  S satisfy  a modified  inverse 

h 

assumption:  There  exists  a positive  constant  c such  that,  for  all 

h > 0 and  all  v,  e S , 
h h 
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'I 


(4.4) 


r 2 Xi  j 

|v.|  , . S I Z 1 ; (D 

h 2’A  U-l  j-0  X,  , 

J i-1 


*2  j \l/2 

V dx 


< c2 h 


'h  " Hl  * 


We  also  require  that  S,  satisfy  a modified  approximation 

h 

hypothesis:  There  exists  a positive  constant  c such  that,  for  all 

2 1 
u e H and  all  h > 0,  there  .exists  a v.  e S satisfying 


(4.5) 


U"Vh"IIs<-C1h 


s = 0, 1 


(4.6) 


U - V)i  2,4  - C1  U H* ' 


Finally  we  require  that  S satisfy  the  following  interpolation 
N-l  h 

hypothesis:  Given  d e ]R  there  exists  an  element  v,  c S satisfying 
~ h h 


(4.7) 


v (x  ) = d , i < i < N-l . 

h i i — 


For  exanple,  these  hypotheses  are  satisfied  when  S.  is  a suhspace  of 

h 

continuous  piecewise  polynomials  of  degree  1 with  respect  to  a 
uniform  nesh  A of  size  h. 

2 

Theorem  3.  Let  u be  the  generalized  solution  of  (4.1)  with  f e L . 

Let  be  a subspace  of  satisfying  (4.3),  the  inverse  assumption 

(4.4),  the  approximation  hypothesis  ( 4 . 5)— (4.6),  and  the  interpolation 

hypothesis  (4.7).  Let  u,  e S be  the  Rayleigh-Ritz  approximation  to 

h h 

u.  There  exists  a constant  R independent  of  h,  f,  and  u such  that 

II  u - u It  < B inf  ||  u - v ||  . 
n - _ h 

v ■ rS 
h h 

v,  interpolates  u 
li 

Proof:  As  in  the  proof  of  Theorem  1,  let  e,  = u - u,  , let  w be  the 
h h 

generalized  solution  to  (4.1)  with  right-hand  side  e,  , and  let  w be 

h h 

the  Rayleigh-Ritz  approximation  to  w in  S . Let  v e S interpolate  u 

h n h 

at  the  knots  of  A;  i.e.,  v^  satisfies  (4.7)  with  d^  » uix^).  By  the 
same  argument  that  leads  Lo  (2.9),  we  obtain 
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(4.8) 

Integrating  by  parts. 


e ||  - a(w,u  - v,  ) - a(wt  ,u  - v.) . 

n h h h 


N Xi 

a(w  ,u  - v ) - E I (aDw.TMu  - v ) + bw  (u  - v.  ) ) dx 
ri  h . - h n h h 

1-1  Xl-1 

X . X , 

r 1 1 A 

- Z |aT)wf(u  - vh)  | + / [-D (aDw^ ) + bw^]  (u  - v(  ) dx  J . 

1-1  1 ' Xi-1  Xi-1  ' > 


But  by  the  interpolation  conditions  (4.7),  the  Integrated  terms 
vanish.  Thus  by  (4.8)  and  (4.2) 

2 N X* 

||e  ||  - a(w,u  - v ) - I / [-D (aDv,  ) + bw  ) (u  - v.  ) dx 

n n . . h h h 

1-1  xi-i 


(4.9) 


“ (eh*u  " vh) 


N 

Z I 
1-1  x 


i 

1-1 


Aw.  ( u - v,  ) dx 
h h 


/ (e,  - Aw  ) (u  - v ) dx 

h h h 

Xi-1 


< II  eh  -AwJI^  ||u  - vjl. 

We  now  bound  the  norm  of  the  residual  ||  eL  - Ay.  ||  by  the  norm 

h h 4 

of  the  right-hand  side  ||  e.  II  . The  proof  is  Identical  to  that  of 

n 

Theorem  2,  except  that  we  use  the  ||  ■ ||  norm  instead  of  the  ||  • || 

A H 4 

norm;  we  omit  the  details.  Together  with  (4.9)  this  yields 
II  eh  II  < B ||  u - vh  ||  . 

Since  all  we  assumed  about  was  that  it  interpolates  u,  the  proof  is 
f lnished . 

Q.  E.  0. 
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NUMERICAL  SOLUTION  OF  GUN  TUBE  PROBLEMS 


IN  THE  ELASTIC-PLASTIC  RANGE 


Peter  C.  T.  Chen 
Benet  Weapons  Laboratory 
Watervliet  Arsenal 
Watervliet,  New  York  12189 


ABSTRACT.  A finite  element  computer  program  for  treating  axisymmetric, 
elastic-plastic  boundary  value  problems  is  described.  Quadrilateral  ring 
elements  are  used.  The  material  is  assumed  to  obey  the  Mises  yield  criterion 
and  the  Prandtl-Reuss  flow  rule.  In  order  to  test  the  accuracy  of  the 
program  the  elastic-plastic  deformation  in  a pressurized  gun  tube  is 
investigated  and  the  results  are  compared  with  a finite-difference  solution. 
The  applicability  of  the  program  to  two-dimensional  axisymmetric  problems  is 
demonstrated  for  a gun  tube  of  finite  length  loaded  over  part  of  its  inner 
surface. 


1.  INTRODUCTION.  Of  all  the  available  elastic-plastic  solutions,  the 
problem  of  pressurized  thick-walled  tubes  has  received  the  greatest  attention. 
This  is  because  of  the  symmetric  nature  of  the  problem  and  its  practical 
importance  to  pressure  vessels  and  the  autofrettage  process  of  gun  barrels. 

Many  investigations  for  the  one  dimensional  problem  have  been  reported  over 
the  last  two  decades.  However,  some  problems  are  still  too  difficult  to  be 
solved  analytically  (ref.  1) . If  we  try  to  solve  a two  dimensional  elastic- 
plastic  problem  involving  partial  differential  equations,  the  chance  of 
success  is  even  more  remote.  With  the  recent  development  in  the  finite  element 
technique  and  high  speed  computer,  numerical  solutions  to  two-  and  three- 
dimensional  elastic-plastic  problems  can  now  be  obtained  (ref.  2,3,4).  Many 
good  computer  programs  such  as  the  MARC  system  have  been  developed,  but  they  are 
not  available  for  general  distribution. 

In  this  paper,  the  incremental  tangent-modulus  approach  of  the  finite 
element  formulation  together  with  the  computer  program  will  be  described. 

The  material  is  assumed  to  obey  the  Mises  yield  criterion  and  the  Prandtl- 
Reuss  flow  rule.  Quadrilateral  ring  elements  will  be  used  to  solve  one 
dimensional  as  well  as  two  dimensional  gun  tube  problems  in  the  elastic- 
plastic  range.  The  one  dimensional  elastic-perfectly-plastic  tube  problem 
was  solved  for  the  purpose  of  evaluating  the  convergence  and  accuracy  of  the 
present  program.  The  two  dimensional  elastic-plastic  strain-hardening  tube 
was  solved  for  demonstrating  the  applicability  of  the  program.  Typical 
results  are  shown  graphically.  This  approach  is  quite  general.  The  elastic- 
plastic  material  can  be  strain  hardening  or  non-hardening.  The  limitation  to 
the  use  of  strain  hardening  material  such  as  in  NASTRAN  (ref.  5)  is  not 
necessary. 
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2.  FINITE  ELEMENT  FORMULATION.  The  variational  principle  for  elastic- 
plastic  solid  (ref.  6)  can  be  used  as  a theoretical  basis  for  the  finite 
element  formulation.  In  the  absence  of  body  forces  and  considering  only 
quasi-static  deformation,  the  principle  states  that  among  all  admissible 
displacement-increment  vectors,  {Au},  the  actual  one  renders  the  following 
functional  stationary 

<P  ~ j f {Aa}T{Ae}dv  - / {Au}T{Af}ds  (1) 

‘ v sf 

where  {Act}  is  the  stress-increment  vector;  {Ae},  the  strain-increment  vector; 
{Au},  the  displacement-increment  vector;  and  {Af},  the  traction-increment 
vector  prescribed  over  a portion  of  the  boundary  Sf.  The  superscript  T 
denotes  the  transpose.  The  stress- increment  vector  is  related  to  the  strain- 
increment  vector  which  is  derivable  from  an  admissible  displacement-increment 
vector. 

An  elastic-plastic  body  is  divided  into  L elements  interconnected  at  a 
finite  number  of  N points.  If  <J)W  is  the  functional  (1)  with  respect  to 
the  £th  element,  then 

L (A) 

4>  = l 4>  (2) 

£=1 

w 

The  elemental  functional  $ can  be  evaluated  approximately  in  the  following 
manner.  First,  an  approximation  is  made  concerning  the  incremental  displace- 
ment, {Au},  within  an  individual  finite  element  in  terms  of  discrete 
quantities  at  nodal  points,  {AU}.  This  relation  can  be  expressed  in  the  form 

{Au}  = [N] {AU}  (3) 

in  which  the  components  of  [N]  are  in  general  functions  of  position  and 
{Au}  satisfy  the  compatibility  conditions  within  and  along  the  boundary  of 
adjacent  elements. 

The  strain- increment  vector,  {Ac},  is  derivable  from  the  assumed  displace- 
ment-increment vector,  {Au},  and  can  be  expressed  in  the  form 

{Ae}  = [B] {AU}  (4) 

where  [B]  is  a kinematical  matrix. 

The  incremental  stress-strain  relations  for  an  elastic-plastic  solid 
which  obeys  Mises'  yield  condition,  Prandtl-Reuss  flow  rule  and  isotropic 
hardening  rule  can  be  expressed  in  closed  form  (ref.  2) . 

{Ac}  = [D] {Ae}  (5) 
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In  the  two-dimensional  axisymmetric  case,  the  constitutive  matrix  [D]  can 
be  written  as 


[D]  = [DE]  - [DP] 


(6) 


with 


and 


where 


— 

— 

[DE]  = 

2G 

l-v 

SYM. 

l-2v 

V 

1-v 

V 

V 

l-v 

0 

0 

0 

i(i-2v); 

_ 

[DP]  = 

2G 

s 

0 1 
Y 

, 2 

SYM. 

V 

'V 

ae'2 

V 

'CTz' 

a0,az’ 

-z’2 

V 

'T 

Y2 

ae'TYz 

a, 'T  T 2 

z yz  yz 

T (0Y  * 

°e  + °z)> 

a ' 
Y 

1 = 

a - a , 
Y m 

etc . 

a2 


= | 52  [1  + | (1+v)u>/(1-gj)] 

“ J [<°Y  “ a0)2  ♦ (o0  - az) 2 ♦ (oz  - ay)2  + 6Tyz2] 


(7) 


(8) 


w = Et/E  , C = - E/(l+v) 

and  Et  is  the  tangent-modulus,  i.e.  the  slope  of  the  effective  stress-strain 
curve  obtained  from  a tension  test  or  from  a torsion  test,  as  shown  approxi- 
mately in  Fig.  2. 


w 

The  elemental  function  4>  » upon  substitution  of  equations  (3),  (4)  and 

(5)  into  the  functional  (1),  becomes 


W 1 T W T 

<P  =7  {AU} 1 [K]  {AU}  - {All} 1 {AF} 

where 

(«-) 

[K]  = / [B]T[D] [D]dv 

v 

is  the  element  stiffness  matrix  and 

(«•)  T 

{AF}  = / [N]  {Af }dsf 

sf 


(*) 


O) 


(10} 


(ID 


is  the  nodal  point  force-increment  vector  over  the  subregion.  The  global 
stiffness  matrix  [K],  and  the  nodal  force- increment  vector,  {AQ},  are  the 
sums  of  those  of  the  subregions.  The  necessary  condition  for  the  func- 
tional to  assume  a stationary  value  gives  the  following  stiffness  equation 

[K] {Aq}  = {AQ}  (12) 

where  {Aq}  is  the  global  displacement- increment  vector  at  all  nodal  points. 


3.  QUADRILATERAL 
which  consists  of  four 
coordinates  of  a ficti 
those  of  nodal  points 
displacement -increment 
the  coefficients  such 
points,  the  stiffness 
obtained  (ref.  4).  To 
ring  element,  the  noda 
elements  are  assembled 
the  form 


RING  ELEMENT.  Consider  a quadrilateral  ring  element 
triangular  ring  elements,  as  shown  in  Figure  1.  The 
tious  nodal  point  5 are  chosen  to  be  the  average  of 
1,2,3  and  4.  By  assuming  a linear  distribution  of  the 
s within  each  triangular  ring  element  and  determining 
that  the  distribution  passes  through  specified  nodal 
matrix  for  each  triangular  ring  element  is  first 
formulate  the  stiffness  matrix  of  the  quadrilateral 
1 force- increment  vectors  for  four  triangular  ring 
, and  the  resulting  equations  are  partitioned  in 


where 


- 

[^aa] 

[Kab] 

{AUa} 

{AFa} 

[Kbaj 

[ K'bb] 

{AUb} 

{AFb} 

(13) 

{AUa}T 

= [Aun» 

AUi2, 

AU 2 1 » AU 22  ’ ^31 

> ^Uj2>  AU41,  AU^t] 

{ AFa }T 

= [AFn, 

AFi2> 

AFt  1 , AF'22 . AFjj 

» ^^32’  ^41>  ^42^ 

{Al)b}T 

[AU51, 

AU52l 

, and  {AFb}T  = 

[AF51,  af52]  . 

(14) 
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Eliminating  the  nodal  point  variables  {AUb}  from  equation  (13),  the 
stiffness  matrix  [K]®-  of  the  quadrilateral  ring  element  becomes 

M*  = tKaa]  - t^abl  t^bb]  1 [Kbal  (15) 

where  [Kbb]”*  denotes  the  inverse  of  the  matrix  [Kbb].  After  determining  the 
nodal  displacement-increment  vectors  of  the  element,  the  strain- increment 
vectors  in  each  triangular  ring  can  be  calculated,  and  the  average  of  the 
strain- increment  values  of  four  triangular  rings  evaluated  at  nodal  point  5 
is  assigned  as  the  value  of  the  strain- increment  vector  of  the  quadrilateral 
ring  element. 

4.  COMPUTER  PROGRAM  AND  EVALUATION.  A digital  computer  program  for 
solving  axisymmetric  elastic-plastic  boundary  value  problems  was  developed 
on  the  basis  of  triangular  ring  elements  (ref.  4).  The  existing  finite 
element  program  has  been  modified  so  that  both  triangular  and  quadrilateral 
ring  elements  can  be  used  in  modelling  the  complete  structure.  The  sequence 
of  the  present  program  is  similar  to  that  of  (ref.  2)  for  plane  stress 
problems.  The  computer  used  is  IBM  360  Model  44.  The  overlay  feature  has 
been  utilized  for  reducing  the  core  storage  requirement.  The  load-increments 
can  be  prescribed  or  determined  by  scaling  to  cause  at  least  one  more 
element  to  become  yielded.  Another  feature  of  this  program  is  its  capability 
of  restarting.  This  enables  the  user  to  restart  a program  from  a point  of 
completion  of  a given  loading  sequence.  This  is  because  all  previous  results 
were  written  on  tape.  In  addition  to  its  restart  capability,  the  results  on 
tape  can  be  used  for  output  plotting. 

In  order  to  evaluate  the  convergence  and  accuracy  of  the  computer 
program,  the  plane-strain  problem  of  elastic-plastic  thick-walled  tube  under 
uniform  internal  pressure  was  first  investigated.  The  tube  of  outside  radius 
2"  and  inside  radius  1"  has  been  divided  into  2,5,10,20  and  25  quadrilateral 
ring  elements,  respectively;  and  the  five-element  model  is  shown  in  Figure  3. 
The  numerical  results  showing  the  relation  between  internal  pressure  and 
inside  radial  displacement  are  given  in  Figure  4.  The  effect  of  reducing  the 
element  size  on  the  rate  of  convergence  is  quite  significant.  The  differences 
between  20  and  25  elements  are  negligible  and  too  small  to  be  shown  in  the 
figure.  Both  applied  and  scaled  loading  approaches  have  been  used.  The  number 
of  loading  steps  for  the  scaled  loading  approach  is  equal  to  the  number  of 
elements.  To  achieve  the  same  accuracy  by  using  both  approaches,  we  need  a 
lot  of  more  steps  for  the  applied  loading  approach.  The  results  shown  in 
Figure  4 are  based  on  the  scaled  loading  approach.  Another  way  of  evaluating 
this  program  is  by  comparing  the  present  solution  with  an  exact  solution  due 
to  Hodge  and  White  (ref.  7).  That  solution  was  based  on  the  finite-difference 
method  and  the  stress  results  were  given  for  an  elastic-perfectly-plastic  tube 
with  b/a  = 2 and  p/a  = 1.5.  The  results  by  the  finite  element  method  for 
the  radial,  tangential  and  axial  stress  distribution  are  shown  in  Figures  5, 

6 and  7,  respectively,  with  b/a  = 2.0  and  p/a  = 1.02,  1.26,  1.50,  1.74,  1.98. 

By  comparing  the  present  results  with  those  in  (ref.  7)  we  can  conclude  that 
the  25-element  model  does  converge  correctly.  This  is  another  indication  of 
the  accuracy  of  the  present  approach. 
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5.  A TWO-DIMENSIONAL  TUBE.  The  formulation  and  computer  program 
developed  here  can  be  used  for  analyzing  axisymmetric  body  of  revolution  under 
axisymmetric  mechanical  loads.  The  illustrated  example  is  a two-dimensional 
elastic-plastic  thick-walled  tube  problem  as  shown  in  Figure  8.  The  tube 
with  inner  radius  (1"),  outer  radius  (2")  and  length  (4")  is  loaded  uniformly 
over  a middle  portion  (2")  of  the  inner  surface.  The  mesh  generation  and  the 
loading  for  the  half  of  the  undeformed  structure  is  shown  in  the  figure.  The 
tube  material  is  assumed  to  be  elastic-plastic  strain  hardening  with  the 
following  properties:  E = 3 x 107  psi,  v = 0.3,  ay  = Yj  = 1.5  x 105  psi, 
a>l  = 0.05,  Y 2 = 2.25  x 105  psi,  w?  = 0.0.  This  problem  was  solved  in  (ref.  8) 
for  a non-hardening  material  base3  on  a different  formulation  and  different 
loading  approach.  For  the  present  problem,  we  have  determined  the  elastic 
solution  at  the  moment  that  the  first  element  becomes  plastic;  the 
corresponding  pressure  is  found  to  be  .5676  ay.  We  then  use  the  scaled 
incremental  loading  approach  until  one  of  the' outside  element  becomes  yielded. 
Ten  additional  cycles  were  needed  and  the  sequence  in  which  the  elements 
becomes  plastic  is  1,5,9,2,13,6,10,3,7,14,11,17,4.  Some  cycle  will  cause 
more  than  one  element  to  become  plastic.  This  is  because  those  elements  with 
effective  stress  a 0.99  have  been  considered  as  plastic.  This  allowance 
is  advantageous  for  saving  computing  time.  It  takes  one  minute  CPU  time  per 
cycle  for  the  present  problem  on  our  computer  (IBM  360  Model  44).  The 
numerical  results  for  the  radial  displacement  at  the  inside,  Ua  (point  1) 
and  outside,  (point  5)  as  functions  of  internal  pressure  are  shown  in 
Figure  9.  The  radial  displacements  along  the  bore  surface  for  various  stages 
of  loading  are  shown  in  Figure  10.  Finally  the  stresses  at  the  centroid  of 
one  inside  element  (No.  1)  are  shown  in  Figure  11.  The  effect  of  loading 
history  on  the  four  stress  components  can  be  seen  from  the  figure. 
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FIGURE  4.  INFLUENCE  OF  NUMBER  OF  ELEMENTS  AND  LOADING  STEPS 
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FIGURE  5.  RADIAL  STRESS  IN  AN  ELASTIC-PERFECTLY-PLASTIC  TUBE 
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FIGURE  7. 
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PERTURBATION  METHODS  FOR  THE  SOLUTION  OF  LINEAR  PROBLEMS 
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Abstract.  Linear  problems  of  central  interest  in  numerical  analysis  are  the  solution 
of  linear  equations,  the  construction  of  the  inverse  or  a generalized  inverse  of  a 
linear  operator,  finding  the  eigenvalues  and  eigenvectors  of  a linear  operator,  and 
linear  programming.  A survey  is  made  of  methods  which  apply  if  the  data  of  a solved 
linear  problem  is  perturbed  by  operators  and  vectors  of  small  norm  (analytic  pertur- 
bation) , or  by  operators  of  finite  rank  and  vectors  belonging  to  a finite-dimensional 
subspace  (algebraic  perturbation) . Perturbation  methods  may  be  used  to  extend  the 
theory  of  linear  problems,  to  estimate  errors  due  to  inaccurate  data  and  computation, 
and  to  solve  perturbed  problems  with  economy  of  effort. 

1.  Linear  problems.  In  the  abstract  framework  of  functional  analysis,  a linear 
problem  is  one  which  can  be  formulated  in  terms  of  linear  spaces  and  operators  [38  , 
Chapter  I] . Naturally,  many  problems  of  theoretical  and  practical  interest  in  numer- 
ical analysis  belong  to  this  general  class.  Among  these  problems,  some  are  important 
enough  to  be  the  subjects  of  extensive  investigations,  and  also  appear  in  the  daily 
workload  of  most  computing  centers  devoted  to  general  scientific  computation.  Of 
these  significant  problems,  the  ones  singled  out  for  discussion  here  are:  (a)  solu- 
tion of  linear  equations,  (b)  inversion  of  linear  operators,  (c)  finding  the  eigen- 
eigenvalues  and  eigenvectors  of  a linear  operator,  and  (d)  linear  programming.  These 
problems  will  now  be  defined  in  appropriate  generality. 

a . Solution  of  linear  equations. 

Let  X,Y  denote  complete  normed  linear  spaces  over  a common  scalar  field  A.  In 
most  applications,  one  has  A = R,  the  real  numbers,  or  A = C,  the  complex  num- 
bers. The  notation  L(X,Y)  will  be  used  for  the  set  of  continuous  linear  operator^ 
from  X into  Y . Given  an  operator  A c L(X,Y)  and  a vector  y e Y as  data,  the 
problem  is  to  find  a solution  x e X of  the  linear  equation 

(1.1)  Ax  = y . 

For  practical  as  well  as  abstract  treatment  of  this  problem,  it  is  important  to 
be  in  possession  of  a theory  of  equation  (1.1),  which  provides  information  as  to 
which  of  the  following  alternatives  holds: 
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(i)  For  each  y e Y,  equation  (1.1)  has  a unique  solution  x £ X ; 


1 (ii)  for  some  ye  Y,  equation  (1.1)  has  no  solution  or  several 
solutions. 

-hoice  between  (i)  existence  and  uniqueness  or  (ii)  nonexistence  or  nonunique- 
or  solutions  exhausts  the  logical  possibilities,  and  thus  the  alternative  struc- 
J-la)  will  be  characteristic  of  the  theory  of  any  equation,  linear  or  nonlinear, 
ase  (i) , the  operator  A is  said  to  be  nonsingular;  otherwise  (case  (ii)),  it 
i<  ailed  a singular  operator. 

b . Inversion  of  linea r operators. 

This  problem  is  closely  related  to  the  solution  of  linear  equations.  In  the 
ingular  case  (i) , equation  (1.1)  defines  the  linear  (right)  inverse  operator  A * 
mc\  gives  the  unique  solution  x as 

x = A *y  . 

n many  applications,  one  has  Y = X and  A * satisfies 
. . 1)  A 1A  = AA-1  = I , 

I denotes  the  identity  operator  in  X , that  is,  lx  = x for  all  xeX. 

In  the  singular  case  (ii) , the  alternatives  are  nonexistence  or  nonuniqueness 
•>1  ut ions  x of  (1.1).  The  inverse  operator  A ^ of  A does  not  exist  in  this 
. but  one  may  seek  a generalized  inverse  P?  of  A which  has  some  properties 
1 are  desirable  for  the  application  at  hand.  For  example,  in  connection  with  the 
lem  of  solving  the  linear  equation  (1.1),  one  might  want 
! . -1 ) x = A y 

■ • a solution  if  the  equation  is  consistent , and  thus  is  satisfied  by  one  or 
• ■lements  of  X.  It  turns  out  that  this  is  equivalent  to  the  condition  that  A^ 
isfies  the  operator  equation 

(1)  AA+A  = A. 

■ . / operator  A^  for  which  (1)  holds  will  be  called  an  inner  inverse  of  A [25,  pp. 
II.  The  more  formal  term  {l}-inverse  of  A (3,  pp.  7-8]  has  also  been  applied 
i finally  to  operators  A^  satisfying  condition  (1). 

If  equation  (1.1)  is  consistent  and  A^  is  an  inner  inverse  of  A , then  all 
utions  x may  be  represented  in  the  form 
l : 5)  x = A^y  + (I  - A^A)z 

reX.  With  z arbitrary,  formula  (1.5)  is  called  the  general  solution  of 
t ion  (1.1),  as  in  the  elementary  theory  of  linear  differential  equations. 
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Generalized  inverses  may  also  be  useful  in  case  equation  (1.1)  lias  no  solut  i 
and  thus  is  said  to  be  inconsistent , or  overdctcrmined . Here,  the  possibility 
choosing  a generalized  solution  x to  minimize  the  norm  of  the  residual  vtvii 

(1.6)  r = Ax  - y 

in  Y is  considered.  Suppose  that  the  set 

(1.7)  G ( A, y ) = {x  | ||  Ax-y  ||  = min  ||  Az-y  ||  } 

ZfX 

of  generalized  solutions  x is  nonempty,  as  will  certainly  be  the  case  if  equ.it  i 

(1.1)  is  consistent.  If  Y is  a Hilbert  space,  then  elements  x £ G(A,y)  are 

narily  called  least-sguares  solutions  of  the  linear  equation  (1.1).  Thus,  one 
might  require  that  A^y  £ G(A,y)  for  all  y £ Y in  addition  to  property  (1).  i 
thermore,  the  subset 

(1.8)  S ( A, y ) = (x  I x £ G ( A , y ) , ||  x ||  = min  ||  z ||  } 

Z£G (A,y) 

of  G(A,y)  may  be  nonempty,  and  would  then  consist  of  the  generalized  (or  least 

squares)  solutions  x of  (1.1)  of  minimum  norm.  The  requirement  that  A^y  £ y(, 

for  all  y £ Y would  then  also  be  a possible  additional  restriction  on  the  set  of 

generalized  inverses  of  A . If  S(A,y)  consists  of  a single  point  for  each 

+ 

then  the  corresponding  generalized  inverse  A is  uniquely  determined.  In  case 

and  Y are  finite-dimensional  Euclidean  spaces,  this  generalized  inverse  A^  ex 

and  is  the  Moore-renrose  inverse  of  A (3,  pp.  7,  103-121),  which,  in  addition  t 

(1) , satisfies  the  condition 

ft  + 

(2)  A AA  = A ; 

that  is,  A is  also  an  outer  inverse  of  A [25,  pp.  12-14],  and  the  symmetiy 
ditions 

t * t 

(3)  (AA  ) = AA  , 

and 

t * t 

(4)  (A  A)  = A A, 

* 

where  M denotes  the  conjugate  transpose  of  the  matrix  M . 

The  problem  of  finding  generalized  solutions  can  become  delicate  in  more  ■ n-  ai 
spaces,  as  the  set  S(A,y)  may  consist  of  more  than  one  element  or  be  empty  [20, i’" 
in  fact,  G (A,y ) will  be  empty  if  the  infimum  of  the  norm  of  the  residual  vecti 
not  attained.  Of  course,  there  are  also  many  applications  of  generalized  invet 
addition  to  the  solution  of  linear  equations  in  the  singular  case  [3,21],  and  thi 
fairly  recent  subject  already  has  a vast  literature  [24], 
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c.  The  eigenvalue-  eigenvector  problem. 

This  problem  is  posed  most  naturally  in  the  case  Y = X is  a Hilbert  space  with 
inner  product  < , > . One  looks  for  scalars  (real  or  complex  numbers)  A and  vec- 
tors x f 0 such  that 


(1.9)  Ax  = A x. 

Solutions  A of  this  problem  are  called  ii (rmuJucs  of  the  linear  operator  A;  for 
each  eigenvalue  A , nonzero  solutions  x cf  (1.9)  are  said  to  be  the  corresponding 
eigenvectors  of  A.  As  equation  (1.9)  is  homogeneous  in  x , the  condition  x y*  0 
may  be  replaced,  for  example,  by 

(1.10)  < x,x  > = 1 , 

or  some  other  norma! ization  condition. 

From  a standpoint  of  functional  analysis,  the  determination  of  the  eigenvalues 
of  A is  a special  case  of  the  more  general  problem  of  finding  the  spectrum  0(A) 
of  A.  In  a complex  Hilbert  space  X , the  set 

(1.11)  p (A)  = {A|  (A  - Al)"1  £ L (X , X)  } 

of  complex  numbers  A is  called  the  resolvent  of  A . Thus,  A e p(A)  if  and 
only  if  the  operator  A - Al  has  a continuous  inverse.  The  spectrum  of  A is  sim- 
ply the  complement  of  the  resolvent, 

(1.12)  0(A)  = C - p (A)  , 
and  hence  contains  any  eigenvalues  of  A . 

d.  Linear  programming. 

In  order  to  formulate  this  problem,  suppose  that  X,Y  are  real  spaces  with 
partial  ordering  relationships  denoted  by  < . For  most  applications,  X and  Y 
are  taken  to  be  finite-dimensional,  in  which  case  the  partial  ordering  is  the  usual 

componentwise  comparison  of  vectors  (26,  pp.  155-158).  Also  needed  is  the  dual 

* 

space  X = L(X,R)  of  X;  that  is,  the  space  of  continuous  linear  functionals  de- 
fined on  X . It  is  convenient  to  use  the  l racket  notation  of  Dirac  )7,  pp.  18-28) 

* 

for  linear  functionals.  If  c e X , then  define 

(1.13)  < c,x  > c(x)  , 

which  will  be  consistent  with  the  notation  for  the  inner  product  if  X is  a Hilbert 
space  17,  pp.  6-8). 


One  formulation  of 
is,  given  A e I,(X,Y), 
(1.14) 


the  ( primal 1 linear  programming  problem  )26, 

* 

y e Y,  c £ X , find  x e X to  maximize 
f (x)  - < c,x  > + £ 


pp. 


156-157] 


subject  to 

(1.15)  Ax  < y , x > 0 . 
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The  function  f (x)  defined  by  (1.14)  is  called  the  objective  function  of  the  prob- 
lem, and  conditions  (1.15)  are  known  as  constraints. 

Instead  of  the  primal  problem  ( 1 . 14) - (1 . 15) , one  may  wish  to  consider  the  dual 

* 

problem  [26,  pp.  188-190]  which  is  to  find  z e Y to  minimize 

(1.16)  g(z)  = < z,y  > - C 
subject  to 

(1.17)  A*z  > c , z > 0 . 

* * * 

In  (1.17),  the  operator  A e L(Y  ,X  ) is  the  adjoint  of  A , defined  by 


(1.18) 


< A y ,x>  = < y , Ax  > 


for  all  y e Y , x e X . It  will  also  be  convenient  to  write 

* * * * * 

(1.19)  y A :=  A y , y £ Y ; 

* 

that  is,  y A is  the  linear  functional  on  X defined  by 

* it  * 

(1.20)  (y  A) x :=  y (Ax)  = < y ,Ax>  , x e X . 

This  is  analogous  to  the  notation  frequently  used  in  elementary  matrix  algebra,  with 

* 

x being  considered  to  be  a column  vector,  and  y a row  vector.  The  scalar  quan- 
tity (1.20)  will  also  be  denoted  by 

* * 

(1.21)  < y Ax  > :=  <y,Ax>. 

The  subject  of  perturbation  methods  and  theory  has  a long  history,  and  there  is 
a vast  literature  devoted  to  this  topic  and  its  applications.  The  bibliography  at 
the  end  of  this  paper,  rather  than  attempting  to  be  comprehensive,  lists  only  refer- 
ences cited  in  the  text,  doubtless  at  the  cost  of  omitting  a number  of  significant 
contributions. 

2.  Perturbed  linear  problems.  Perturbation  theory,  as  applied  to  the  linear  prob- 
lems listed  in  §1,  starts  from  the  assumption  that  their  solutions  are  known  for  the 

* 

given  reference  data  A e L(X,Y),  y e Y,  c e X . The  object  is  to  study  the  be- 
havior of  these  solutions  for  various  classes  of  perturbed  data. 

(2.1)  B = A + AA  , z = y + Ay  , d = c + Ac  , 

* 

where  the  perturbations  Aa  £ L(X,Y),  Ay  e Y , Ac  € X or  appropriate  information 

about  them  are  given.  One  then  desires  to  calculate  or  estimate  the  corresponding 

changes  Ax  e X,  Aa  1 e L(Y,X),  Aa^  e L(Y,X) , AX  e A in  the  solutions  x e X,  A 1 e 

L (Y, X) , A^  e L(Y,X),  X e A of  the  original  problems.  Here 

-1 


(2.2) 


Aa 


b'1  - A’1 


denotes  the  difference  between  the  inverse,  if  it  exists,  of  the  perturbed  operator 
B and  the  inverse  of  the  unperturbed  operator  A , and  not  (Aa)  *,  which  may  also 
exist.  A similar  observation  applies  to  the  notation  AA^. 
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As  an  example,  the  perturbed  linear  system 

(2.3)  Bw  = z 
can  be  solved  for 

(2.4)  w = x + Ax 

if  Ax  can  be  obtained  in  terms  of  Aa  and  Ay,  the  solution  x of  the  unperturbed 
system  (1.1)  with  reference  data  A,  y being  assumed  to  be  known. 

The  motivation  behind  perturbation  methods  is  that  if  the  perturbations  in  the 
data  are  "small"  in  some  sense,  then  one  might  expect  the  changes  in  the  solutions  to 
be  correspondingly  small,  at  least  under  suitable  conditions.  What  is  referred  to 
here  as  "small"  may  vary  widely,  depending  on  the  specific  problem,  the  tyj>e  of  per- 
turbation considered,  the  computing  power  available,  and  perhaps  other  factors.  In 
the  next  section,  a framework  will  be  developed  to  characterize  the  concept  of  small 
perturbations  more  precisely. 

The  goals  of  perturbation  theory  may  be  either  practical  or  theoretical.  Two 
uses  of  per.urbation  methods  in  actual  computation  are  to  find  solutions  of  perturbed 
problems  with  economy  of  effort,  and  to  obtain  error  estimates.  In  the  first  case, 
computing  the  solution  of  a given  linear  problem  might  be  extremely  laborious,  but  a 
large  amount  of  information  could  be  generated  in  the  process.  One  would  then  hope 
to  be  able  to  use  this  information  to  solve  perturbations  of  the  reference  problem 
with  less  work  than  required  when  starting  from  scratch,  as  indicated  in  connection 
with  the  illustration  (2. 3) -(2. 4)  cited  above.  In  the  case  of  error  estimation,  the 
perturbations  are  considered  to  arise  from  inaccuracies  in  the  data  and  from  trunca- 
tion and  roundoff  errors  in  the  computation.  Usually,  these  perturbations  can  only 
be  estimated,  and  one  seeks  some  kind  of  information  about  the  possible  error  in  the 
solution.  One  approach,  called  forward  error  estimation , starts  from  assumptions 
about  the  perturbations  in  the  data,  and  obtains  a comparison  of  the  solution  actual- 
ly obtained  with  that  of  the  reference  problem  if  exact  data  and  computation  were 
employed.  For  backward  error  estimation,  as  developed  by  Wilkinson  [34] , the  solu- 
tion actually  obtained  is  taken  to  be  the  exact  solution  of  some  perturbation  of  the 
reference  problem,  and  estimates  are  made  of  the  corresponding  changes  in  the  data> 
With  the  forward  method,  the  computed  solution  is  considered  to  be  acceptable  if  it 
can  be  shown  to  be  "close"  to  the  (unknown)  solution  of  the  reference  problem,  while 
in  the  backward  procedure,  the  criterion  of  acceptability  is  that  the  problem  actual- 
ly solved  is  "close"  to  the  reference  problem  in  some  sense.  More  precise  concepts 
of  "closeness"  will  be  introduced  in  the  next  section. 

Perturbation  methods  can  also  be  used  for  theoretical  purposes.  If  a conceptual 
framework  can  be  developed  in  which  the  problems  considered  can  be  viewed  as  pertur- 
bations of  problems  with  known  theory,  then  it  may  be  possible  to  extend  this  theory 
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from  one  class  to  the  other.  This  is  the  basis,  for  example,  of  the  classical  tech- 
nique of  Erhard  Schmidt  for  obtaining  the  theory  of  linear  Fredholm  integral  equa- 
tions of  second  kind  from  the  theory  of  finite  linear  algebraic  systems  [4,  p.  155). 

A more  general  situation  will  be  described  in  a later  section.  Another  theoretical 
use  of  perturbation  methods,  closely  related  to  error  estimation,  is  to  determine 
the  sensitivity  of  the  solution  of  a linear  problem  to  changes  in  the  data.  For 
example,  one  may  wish  to  know  which  components  of  the  solution  are  affected  most 
strongly  by  a small  change  in  one  of  the  coefficients  of  the  input  data,  and  which 
are  relatively  undisturbed.  This  kind  of  analysis  can  also  be  used  to  pursue  cause- 
and-effect  relationships  in  mathematical  models  of  various  natural  systems  and  proc- 
esses. 

3.  Analytic  and  algebraic  perturbations.  For  the  present  purposes,  it  will  be  con- 
venient to  classify  perturbations  into  two  nonexclusive  categories;  analytic  and 
algebraic.  This  classification  arises  from  the  information  available  in  each  case 
and  the  methodology  used  to  solve  the  perturbation  problem,  as  well  as  an  attempt  to 
clarify  what  is  meant  by  a "small"  perturbation.  In  general,  analytic  perturbation 
theory  uses  metric  information,  and  obtains  solutions  to  perturbation  problems  in 
terms  of  series  expansions,  or  by  iterative  methods.  An  objective  criterion  for  a 
perturbation  to  be  small  in  this  case  is  that  the  required  series  or  iterations  con- 
verge. A more  subjective  condition  is  that  the  convergence  be  rapid  enough  to  be 
useful  in  practice.  The  satisfaction  of  this  restriction  will  depend,  among  other 
things,  on  the  computing  power  available  and  whether  the  transformations  involved 
can  be  carried  out  explicitly,  or  have  to  be  approximated. 

The  idea  of  smallness  for  algebraic  perturbations  also  depends  more  or  less  on 
outside  factors.  Here,  the  perturbations  of  operators  are  operators  with  finite- 
dimensional range,  and  vectors  and  functionals  are  perturbed  by  elements  belonging 
to  finite-dimensional  subspaces  of  the  corresponding  spaces.  The  solution  of  alge- 
braic perturbation  problems  will  require  solving  finite  algebraic  problems  of  similar 

type,  with  the  judgment  as  to  what  constitutes  a "small"  finite  algebraic  problem  be- 

<• 

ing  again  tied  up  with  the  resources  available  for  computing.  For  example,  early 
workers  in  the  theory  of  linear  integral  equations  knew  that  replacing  them  by  a cor- 
responding finite  linear  algebraic  system  would  yield  good  approximate  solutions, 
but  despaired  of  being  able  to  solve  systems  of  order  10  or  20,  as  might  be  required 
to  attain  the  desired  accuracy  [18,  p.  242J . By  contrast,  today  most  computing  cen- 
ters are  able  to  furnish  the  solutions  of  well-conditioned  linear  algebraic  systems 
of  order  100  or  200  at  nominal  cost. 
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More  precise  formulations  will  now  be  made  of  the  type  of  information  given  and 
expected  with  each  type  of  perturbation. 

a.  Analytic  perturbations. 

The  fundamental  metric  information  about  vectors  in  Banach  spaces  X,Y, ...  is 
given,  of  course,  by  the  respective  norms  ||  I Ijj » II  lly  * - - • • As  confusion  is  un- 
likely, the  subscripts  will  usually  be  dropped.  If  A is  a continuous  linear  oper- 
ator from  X into  Y , that  is,  if  Ac  L(X,Y),  then  the  numbers 

(3.1)  M(A)  = sup  IIaxII  , m(A)  = inf  ||  Ax||, 

11x11=1  ||*H-1 

exist  and  are  finite  [2,  p.  54;  16,  p.  194]  . M (A)  and  m(A)  are  called  the  upper 
and  lower  bound  of  A,  respectively.  With  the  natural  definitions  of  addition  and 
scalar  multiplication  of  linear  operators,  it  is  well  known  [38,  p.  163]  that  L(X,Y) 
is  a Banach  space  for  the  operator  norm  ||a||  = M(A).  In  some  spaces,  this  norm  is 
easy  to  compute,  but  in  others,  finding  M(A)  might  require  more  effort  than  solving 
the  problem  of  interest.  For  numerical  purposes,  it  is  often  convenient  to  assign  a 
norm  to  the  linear  operator  space  L(X,Y)  which  is  easier  to  compute  than  the  oper- 
ator norm,  and  is  consistent  with  it  in  the  sense  that 

(3.2)  ||a||  > M(A)  . 

For  example,  if  X = Y = En , (complex)  n-dimensional  Euclidean  space,  then  A = (a_) 
is  represented  by  an  nxn  matrix  with  eigenvalues  A^ , A^ , . . . , A . One  has 

(3.3)  M(A)=  max{  | AjJ  , | | , 

which  requires  finding  the  eigenvalue  of  largest  modulus  of  A.  On  the  other  hand, 
the  Euclidean  norm  of  A , 

(3.4)  II  A||  = ( l l |a  |2  ) 

i=l  j=l  13 

is  consistent  and  may  be  found  by  a straightforward  calculation. 

* * 

In  the  case  of  the  adjoint  spaces  X ,Y  , ...  of  continuous  linear  functionals 
on  X , Y , . . . , the  norm  will  always  be  defined  analogously  to  the  operator  norm,  that 
is, 

(3.5)  ||  c ||  = sup  | < c,x>  | 

II  * 11=1 

* 

for  c € X . 

Thus,  in  the  perturbed  linear  system  (2.3),  one  would  want  a convergent  process 
to  calculate  Ax,  or  an  estimate  for  1 1 Ax  ||  in  terms  of  bounds  for  ||Aa  ||  and 
||  Ay ||  , and  perhaps  also  the  known  quantities  ||  A ||  , ||  x ||  , ||y||  . 

b.  Algebraic  perturbations. 

An  algebraic  perturbation  Ay  of  a vector  y e Y is  defined  to  be  an  element 
of  a f inrte-dimensional  subspace 
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(3.6)  Yn  = span{yi(y2,...,yn) 

Of  Y consisting  of  all  linear  combinations  of  given  independent  basis  vectors 
y^,y^,...,y  in  Y . A similar  definition  applies  to  algebraic  perturbations  of 
linear  functionals.  Ordinarily,  algebraic  perturbations  will  be  restricted  to  sub- 
spaces with  small  dimension  (in  the  sense  described  above).  However,  if  the  original 
spaces  are  finite-dimensional,  it  is  of  course  jjossible  to  represent  an  arbitrary 
perturbation  as  an  algebraic  perturbation. 

In  the  case  of  linear  operators,  algebraic  perturbations  are  represented  by 
linear  operators  with  finite-dimensional  ranges.  Such  operators  are  said  to  be  of 

finite  rank,  or  degenerate  (in  infinite-dimensional  spaces).  Here,  the  dyadic 

* 

notation  of  Dirac  [7,  pp.  26-28]  will  be  adopted;  for  u £ Y,  v £ X , the  symbol 

u >< v will  represent  an  operator  of  rank  one  from  X into  Y , with 

(3.7)  (u><v)x  = u<v,x>=<v,x>ueY 

* * 

for  x £ X.  Also,  for  y £ Y , the  transposed  operation  will  be  denoted  by 

* * * 

(3.8)  y (u  ><v)  = < y ,u>  v f X , 

again  consistent  with  the  notation  introduced  in  Si.  In  these  terms,  a general 

algebraic  perturbation  Aa  £ L(X,Y)  of  rank  n will  be  written  as 

n 

(3.9)  Aa  = l u.  ><  v.  , 

* 

where  the  vectors  u.  £ Y and  functionals  v.  £ X , i = 1,2, ... ,n,  form  linearly 

l i 

independent  sets.  The  range  of  the  operator  (3.9)  is  Y^  = span(u^ ,u^ , . . . ,u^} . In 

the  finite-dimensional  case,  Y could  coincide  with  Y , and  arbitrary  perturba- 

n 

tions  of  linear  operators  could  be  written  in  the  form  (3.9). 

Algebraic  perturbations  of  vectors  and  linear  operators  are  sometimes  referred 
to  as  finite  rank  modifications . This  terminology  is  useful  if  a clear  distinction 
between  analytic  and  algebraic  methods  is  intended.  By  the  use  of  algebraic  pertur- 
bation theory,  one  would  expect  to  obtain  the  perturbations  in  solutions  of  linear 
problems  in  the  same  form  as  the  perturbations  in  the  data.  For  example,  one  would 
want  to  express  Ax  as  a linear  combination  of  vectors  x^,x2,...,xn  to  be  deter- 
mined, that  is,  Ax  £ span[x  ,x  , . . . ,x  } = X , a finite-dimensional  subspace  of  *. 

A 2 n 1 

Similarly,  expressions  of  the  form  (3.9)  for  AA  and  AA  would  be  sought.  In 
other  words,  algebraic  perturbations  in  the  data  of  linear  problems  are  expected  to 
give  rise  to  finite  rank  modifications  of  their  solutions. 

In  contrast  to  analytic  perturbation  theory,  the  use  of  algebraic  methods  does 
not  involve  restrictions  on  the  norms  of  the  perturbations  in  the  data.  However,  it 
is  possible  that  algebraic  perturbations  can  be  small  in  the  analytic  sense,  so  that 
either  technique  could  be  employed.  Also,  as  illustrated  in  the  next  section,  cer- 
tain problems  lend  themselves  to  a contoination  of  algebraic  and  analytic  methods. 


449 


4.  Compact  operators  and  the  Fredholm  theory.  A theoretical  ajiplication  of  pertur- 
bation methods,  which  also  has  implications  for  numerical  computation,  is  the  exten- 
sion of  the  theory  of  finite  linear  alyebraic  systems  of  n equations  in  n unknowns 
to  certain  types  of  linear  equations  (1.1)  in  infinite-dimensional  spaces.  An  exten- 
sion of  this  kind  will  be  obtained  here  by  the  use  of  both  analytic  and  algebraic 
techniques.  First,  the  alternative  structure  (1.1a)  of  the  theory  of  equation  (1.1) 
will  be  given  an  explicit  formulation  for  the  class  of  operators  to  be 

Definition  4.1.  Linear  operators  belonging  to  a class  £7  C l(X,Y)  are  said  to 
have  a Fredholm  theory  if  for  each  A e £7  , either  (i)  the  homogeneous  equation 

(4.1)  Ax  = 0 


has  the  unique  solution  x = 0,  in  which  case  the  inhomogeneous  equation  (1.1)  has  a 
unique  solution  x for  each  ye  Y,  or  (ii)  equation  (4.1)  has  nonzero  solutions, 
each  of  which  can  be  expressed  as  a linear  combination  of  a finite  number  d linearl 
independent  solutions  x^ , x^ , . . . ,x^  e X , in  which  case  the  transposed  homogeneous 
equation 


(4.2)  zA  = 0 

* 

likewise  has  d linearly  independent  solutions  z, ,z_,...,z.  e Y , in  terms  of 

12  d 

which  all  its  nonzero  solutions  are  expressible  as  linear  combinations,  and  the  in- 
homogeneous equation  (1.1)  has  no  solutions  unless 

(4.3)  < z.,y>  = 0 , i = 1,2, d. 

If  (4.3)  is  satisfied  and  x^  is  any  solution  of  (1.1)  (sometimes  called  a parti- 
cular solution) , then  the  general  solution  of  the  inhomogeneous  equation  can  be 
written  as 


(4.4) 


x = xo  + 


E“i  Xi  ' 


i-1 


with  arbitrary  scalars  . 

For  the  algebraic  case  X = Y = Rn,  real  n-dimensional  space,  the  class  d of 
linear  operators  with  Fredholm  theory  consists  of  all  nxn  real  matrices  A = (a. .), 
that  is,  <7  = L(Rn,Rn),  and  the  alternatives  in  Definition  4.1  were  known  to  hold 
long  before  1903,  when  the  Norwegian  mathematician  Ivar  Fredholm  [6]  established  t)\p 
corresjxandence  between  the  theories  of  finite  linear  algebraic  systems  and  linear 
integral  equations  of  the  form 


(4.5)  x (s)  - X / K(s,t)  x(t)dt  = y(s),  0 < s < 1, 

0 

giving  rise  to  the  present  name  for  the  theory. 

Definition  4.2.  A linear  operator  K e L(X,Y)  is  said  to  be  compact  if,  given 
any  E > 0 , there  exists  a positive  integer  n = n(E)  such  that 

(4.6)  K = S + F , 

where  ||  S ||  < c and  F is  of  finite  rank  n. 
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A compact  operator  may  thus  be  regarded  as  a small  analytic  perturbation  of  ar. 
operator  of  finite  rank,  or  as  a finite  rank  modification  of  an  operator  which  is 
small  in  the  analytic  sense.  It  will  be  shown  that  the  Fredholm  theory  can  be  ex- 
tended to  operators  which  can  be  expressed  as  the  sum  of  a linear  operator  having  a 
continuous  inverse  and  a compact  operator.  That  is,  if  P C L(X,Y)  denotes  the 
class  of  linear  operators  J such  that  J ^ e L(Y,X)  exists,  X C L(X,Y)  the  class 
of  compact  operators,  and  (2  = $ ® ^ the  class  of  linear  operators  of  the  form 
(4.7)  A=J+K,  ,K£K 

then  each  A e (2  has  a Fredholm  theory.  This  assertion  will  be  proved  in  the  next 
section  by  combining  results  from  both  analytic  and  algebraic  perturbation  theory. - 
First,  it  will  be  shown  that  if  J e p , then  one  has  the  well  known  result  that 
J + AJ  e p for  ||AJ||  sufficiently  small.  Later,  the  Fredholm  alternative  given 
in  Definition  4.1  will  be  established  for  operators  which  are  the  sum  of  invertible 
linear  operators  and  linear  operators  of  finite  rank.  The  statement  that  operators 
of  the  form  (4.7)  have  a Fredholm  theory  will  then  follow  from  Definition  4.2. 

5.  Nonsingular  linear  equations  and  operators.  In  this  section,  the  problems  of 
solving  linear  systems  and  the  inversion  of  linear  operators  will  be  considered  for 
the  nonsingular  case.  Here,  alternative  (l.la(i) ) holds,  and  the  inverse  A 1 of 
the  operator  A exists. 

a.  Analytic  perturbation  of  well-posed  problems. 

Definition  5.1.  A problem  is  said  to  be  well-posed  if  it  has  a unique  solu- 
tion which  depends  continuously  on  the  data. 

As  a general  rule,  analytic  perturbation  methods  are  only  successful  when  ap- 
plied to  well-posed  problems.  This  can  require  the  imposition  of  additional  condi- 
tions on  the  data  to  insure  uniqueness  and  continuous  dependence  of  the  solution,  at 
least  in  some  neighborhood  of  the  solution  of  the  reference  problem.  For  the  linear 
problems  considered  in  this  section  to  be  well-posed,  the  continuity  (and  hence 
boundedness)  of  A is  required  in  addition  to  its  existence.  Consequently,  it 
will  be  assumed  that  A ^ e L(Y,X)  in  the  following  discussion  of  the  application  of 

% W 

analytic  perturbation  theory.  If  A maps  X onto  Y , then  it  is  well  known  that 
A * e L(Y,X)  if  and  only  if  m(A)  > 0 [2,  pp.  145-150).  Lonseth  (16,  p.  194)  has 
derived  the  relationship 

(5.1)  m(A)M(A_1)  = M ( A) m ( A_1 ) = 1 

between  the  upper  and  lower  bounds  of  a linear  operator  A with  the  continuous  in- 
verse A 1.  Furthermore,  (A  + Aa)  1 exists  if  M(Aa)  < m(A) . Using  (5.1),  this 
result  may  be  stated  in  terms  of  consistent  norms. 
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Theorem  5.1.  If 


Aa 


.-1, 


, then  (A+AA)  ^ exists  and  is  given  by 


(5.2)  (A+AA)  1 = l (-A_1AA)nA  1 . 

n=0 

Proof:  The  hypothesis  guarantees  the  convergence  of  the  Neumann  series 

on  the  right  side  of  (5.2).  Denoting  this  series  by  S , one  finds  by  direct  manip- 
ulation that  (A+AA) S = I , the  identity  operator  in  Y , and  S (A+AA)  * I ; hence, 

-1  Y X 
S = (A+AA)  . QED 

Although  the  Neumann  series  expansion  (5.2)  is  useful  for  theoretical  purposes,  it  is 
likely  to  be  too  slowly  convergent  for  practical  computation.  The  partial  sums 

, n.  -1 


(5.3) 


S = l (-A_1AA)nA" 
n=0 


of  the  Neumann  series  (5.2)  may  be  obtained  by  the  simple  iteration 


(5.4) 

From  (5.2)  , for  0 = 

(5.5) 


SQ  - A > sk 

a_1Aa||  , 

II  (A+AA)"1  - s. 


S0  - (A"1AA)Sk_1. 


1,2,... 


e 


k + l 


,-l| 


-k  n = i-o  11  ” »• 

In  order  to  find  a more  efficient  method,  the  Hotell ing-Lonseth  algorithm  [17]  may 
be  adapted  to  this  purpose.  In  this  special  case,  the  iteration  process  is 

-k+l 


(5.6) 


B0  = A 


Bk  = U + 


-A  W 


] B, 


k-1 ' 


k = 1,2,... 


It  is  easy  to  show  by  mathematical  induction  that  B = S . ; hence,  from  (5.5), 

k 2K-1 
-k 


(5.7) 


||(a+Aa) _1  - B 


-i 


= l-G 


-l 


so  that  the  sequence  {b^}  defined  by  (5.6)  converges  quadratically  to  (A+Aa) 

The  only  additional  labor  required  over  the  more  slowly  convergent  algorithm  (5.4)  is 
the  repeated  squaring  of  the  small  operator  -A  *Aa  . 

Attention  will  now  be  devoted  to  the  estimation  of  the  perturbations  Aa  * and 
Ax  in  the  inverse  of  the  perturbed  operator  and  the  solution  of  the  perturbed  linear 
equation  (2.3),  respectively  (14,  15,  16).  It  will  be  helpful  to  introduce  the  no- 
tion of  the  condition  number  of  a bounded  linear  operator.  For  A e L(X,Y),  the 
exact  condition  number  K(A)  of  A is  defined  to  be 

M ( A) 


(5.8) 


K(A)  = 


m(A)  ' 

and  is  a measure  of  the  distortion  of  the  image  in  Y of  the  unit  ball  in 


transformed  by  the  operator  A 
-1 


If  A has  a continuous  inverse,  then  K (A)  = 


M (A) M (A  ) by  (5.1).  For  computational  purposes,  it  may  be  expedient  to  use 
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consistent  norms  for  L(X,Y)  and  L(Y,X) , and  the  approximate  condition  number 
(5.9)  k(A)  = II  All  • Ha'1!!  , 

which  is  an  upper  bound  for  k(a).  The  inequalities  given  below  will  be  stated  in 
terms  of  consistent  norms  and  approximate  condition  numbers,  but  remain  valid  if 
these  upper  bounds  are  replaced  by  their  exact  values. 

First,  from  (5.2) , 


(5.10) 
and  thus , 

(5.11) 

Dividing  (5.11)  by  | 
side  by  ||  A ||  gives 

(5.12) 


Aa'1  = (A+Aa)  1 - A_1 


l ( -a” 1 Aa) "a” 1 , 

n=l 


I Aa-1  II  < 


-1„2 


AA 


1 - A 


Aa 


.-li 


and  multiplying  and  dividing  ||  Aa||  on  the  right  hand 


AA 


-1| 


V / n > — L 

Aa 

L 

1*1 

r, 

l-k(A) 

which  expresses  the  relative  change  in  the  inverse  in  terms  of  the  relative  perturba- 
tion of  the  reference  operator  and  its  (approximate)  condition  number.  A similar 
expression  will  now  be  obtained  for  the  perturbation  Ax  in  the  solution  of  (2.3). 


Theorem  5.2.  If 


Aa 


11a' 


— , then  the  perturbed  linear  equation  (2.3) 


has  a unique  solution  w = x + Ax  for  each  z = y + Ay,  and 

]|  Axil  „ k (A) 


(5.13) 


1 - k(A)J 


A*  II 


a II  II  y || 


provided,  of  course,  that  y jt  0. 

Proof:  By  Theorem  5.1,  the  hypothesis  guarantees  that  B 1 = (A+AA)-1  = A-1+Aa-1 
exists,  which  implies  the  unique  solvability  of  (2.3)  for  each  z . Writing  (2.3)  'as 

(5.14)  (A  + AA)  (x  + Ax)  = y + Ay  , 
one  obtains 

(5.15) 

As  y = Ax,  from  (5.10), 

OO 

n. 


Ax  = Aa  ^y  + (A+Aa)  ^Ay  . 


(5.16) 


Aa  ^y 


l (-A_1AA)nx  , 
n=l 
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so  that 


(5.17)  ||  M"ly||  < — 11  ’ll  Aftll  * II x II  , )C(A|[  • JUilL  • 

* « II  «“1II  .11  A.  II  , . II-  II 


l-ll  A'1!!  • II  AA ||  l-k(A) 


k (A) 

Aa 

L 

1-  k(A>- 

A 

r 

II  AX  || 

< 1 

i* 

II  A || 


Similarly,  from  (5.2)  and  the  fact  that  ||  y| 

(5.18)  ||  (A+AA) -1Ay ||  < — II  A~  H * II  Ay  II-  . II  All  ' Hx  11  « 

‘ l-ll  A"1  II  -II  Aa||  IMI 


. l-k(A) 


k(A)  ||  Ay  ||  n i, 

— JLsaJL  ’If?  11  x|1- 


Inequality  (5.13)  now  follows  directly  from  (5.15),  (5.17),  and  (5.18).  QED 


b.  Algebraic  perturbation  of  nonsinqular  linear  equations  and  operators. 

The  simplest  type  of  algebraic  perturbation  (2.3)  of  the  linear  system  (1.1)  is 

with  AA  **  0 and  Ay  restricted  to  belong  to  a finite-dimensional  subspace  Y^  of 

Y.  Given  a basis  {y.,y_,...,y  ) for  Y , one  need  only  find  the  corresponding 
12  n n 

basis  vectors 

(5.19)  - A 1y^  , i = 1,2, ...,n, 

of  the  subspace  X^  C x which  will  then  contain  all  possible  perturbations  Ax. 

Thus , given 

(5.20)  Ay  - ay  + ay  + ...  + ay  , 

ii  2 2 n n 

it  follows  that 

(5.21)  Ax  = a,x  + a*  + ...  + a x . 

11  2 2 n n 

In  actual  computation,  it  may  be  more  efficient  to  solve  the  n systems  Ax^  » y^  , 

i ■■  l,2,...,n,  for  the  basis  vectors  for  X , even  if  X is  finite-dimensional  (5, 

-1  n 

p.  77],  than  to  calculate  A 

To  introduce  the  study  of  the  effect  of  a finite-rank  modification  of  an  opera- 
tor upon  its  inverse,  the  case  of  rank  one  perturbation  will  be  considered  first,  as 
all  the  indicated  operations  can  be  displayed  explicitly.  For  AA  = u><V  with 
u e Y,  v e X nonzero,  the  solvability  of  the  perturbed  system  (2.3),  that  is 

(5.22)  (A  + u ><  v)w  = z , 

will  be  investigated  for  arbitrary  z . As  A 1 is  assumed  to  exist,  the  equations 
Au  - u,  Az  **  z can  be  solved  uniquely  for  u = A *u,  z - A *z,  respectively.  In 
terms  of  these  solutions,  (5.22)may  be  written  as 

(5.23)  w = z - G < v,w>  . 
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The  Key  to  the  solvability  of  (5.23),  and  hence  of  (5.22),  is  the  determination  of 
the  number  £ = < v,w  > . From  (5.23), 

(5.24)  < v,w  > + < v,u  > <v,w  > = < v , z > 

If  the  determinant 

(5.25)  6 = 1 + v.  v,u  > • 1 ► 

does  not  vanish,  then  (5.24)  has  the  unique  solution 

<v,z> 

6 


VA  1U  > 


(5.26) 


<V,W>  = 


< v A 2z> 


1+  <V  A > 


where  the  notation  (1.21)  has  been  used  in  (5.25)  and  (5.26). 
into  (5.23)  yields 


Substitution  of  (5.26) 


(5.27) 

so  that 

(5.28) 


1 i < v A 

w = A z - A u ; 


— = ( A_1 


A *u>  < vA  X 
1+  < vA  X u > 


(A  + uXv)  1 = A1 


A_1u><vA_1 


1 + < vA  u> 

provided  6 ? 0.  Hence,  the  inverse  of  a rank  one  modification  of  an  invertible  oper- 
operator,  if  it  exists,  is  a rank  one  modification  of  the  inverse  of  the  reference 
operator.  The  symmetry  of  (5.28),  sometimes  called  the  Sherman-Morr ison-Woodbury 
formula  (11,  pp.  123-124;  35,  46],  is  appealing. 

Using  (5.28),  the  solution  w = x + Ax  of  (5.22)  is 


(5.  29) 


x + Ax  = x + A Ay 


-1^  <v , x+ A Ay> 

1 t *•'  vA  * u > 


A^U 


(5.30) 


AX 


A XAy  - 


< v , xtA  1 y > 
1 + < vA  X u > 


,-l 
A u 


Thus,  the  perturbation  Ax  is  a linear  combination  of 

.-1 


-1 


Ay  and  the  vector 

is  inde- 


u = A ~u.  If  Ay  is  an  algebraic  perturbation  of  the  form  (5.20) , and  u 
pendent  of  the  vectors  x^  = A Xy.  , i = l,2,...,n,t  then  Ax  will  lie  in  the  (n+1)- 
dimensional  subspace  X ~ ' 1 ““  ” n“  “ 


a i 

„ . = span  f u, x. , x„ , . . . , x } of  x ; otherwise  Ax  e X 

n+1  1 2 n n 


span  {x, ,x_,. . . ,x  } . 

12  n 

Before  going  to  the  general  case,  two  applications  of  algebraic  perturbation 
theory  will  be  given  which  involve  rank  one  modifications.  The  first  is  to  the 
Fredholm  integral  equation  (4.5)  in  which  the  kernel  K(s,t)  has  the  special  form 

r u(t)v(s),  0 < t < s < 1, 

(5.31)  K(s,t)  = ( 

L u(s)v(t),  0 < s < t < 1 . 
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This  type  of  kernel  arises  in  applications;  for  example,  as  a Green's  function  deter- 
mined by  a two-point  boundary  value  problem  [32].  Given  the  representation  (5.31) 
for  K(s,t),  the  integral  equation  (4.5)  may  be  written  as 

(5.32)  x(s)  - X /s  L(s,t)  x(t)dt  - X J^u(s) v(t)x(t)dt  = y(s)  , 

. 0 .0 
where 

(5.33)  L(s,t)  = u(t)v(s)  - u(s)v(t),  0 = » s « 1 • 

Equation  (5.32)  is  of  the  form  (A  - Xu  ><v)x  = y,  where  A = I - XL  is  a linear 
Volterra  integral  operator  of  second  kind  with  kernel  (5.33),  and  u ><  v is  a 
Fredholm  integral  operator  of  first  kind  and  rank  one  with  kernel  u(s)v(t) . The  in- 
verse A * « (I  - XL)  ^ of  the  Volterra  operator  of  second  kind  exists  for  all  X’ 
(30,  pp.  52-53],  and  thus  the  linear  Volterra  integral  equation 

(5.34)  w(s)  - X fa  L ( s , t ) w(t)dt  = w(s) 

0 

can  be  solved  for  arbitrary  w(s)  ; in  particular,  one  obtains  w(s)  = u(s)  for 
w(s)  = u(s),  and  w(s)  * y(s)  for  w(s)  = y(s).  Corresponding  to  (5.25),  if  the 
Fredholm  determinant 

(5.35)  6 = 1 - X<v,u  > = 1 - X /1v(t)u(t)dt 

0 

does  not  vanish,  then,  from  (5.27), 

(5.36) 


x (s)  » y(s)  + -g-  u(s)  /1v(t)y(t)dt 


is  the  unique  solution  of  (5.32).  Hence,  the  solution  of  the  Fredholm  integral  equa- 
tion (4.5)  with  the  kernel  (5.31)  can  be  obtained  by  solving  the  Volterra  integral 
equation  (5.34)  with  right-hand  sides  w(s)  = u(s)  and  w(s)  = y(s),  followed  by  the 
calculation  of  the  inner  product  integrals  in  (5.35)  and  (5.36). 

The  second  application  to  be  considered  for  rank  one  modification  of  a linear 
operator  is  to  backward  error  analysis  in  the  solution  of  linear  equations.  Suppose 
that  one  attempts  to  solve  the  linear  equation  (1.1)  and  obtains,  instead  of  x , an 
approximate  solution  w such  that 

(5.37)  A w “ y + r , 

with  nonzero  residual  r , The  Hahn-Banach  theorem  [38,  p.  186]  guarantees  the 
existence  of  a linear  functional  w e x such  that  ||  w ||  “ 1 and  <w  ,w>  *=  ||  w ||  . 
Thus,  w is  the  exact  solution  of  the  linear  equation 


(5.38) 


( A " 


r ><  w 


) w 


with  perturbed  operator  and  desired  right-hand  side.  An  analytic  bound  for  the 
perturbation  of  A is  thus 


(5.39) 


II  aa|| 


r><w  II  I!  r || 
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Returning  to  the  study  of  general  algebraic  perturbations,  note  that  the  equiv- 
alence of  (5.22)  and  the  single  scalar  equation  (5.24)  establishes  that  (5.22)  has  a 
Fredholm  theory,  because  (5.24)  does.  In  the  case  6 =0,  the  homogeneous  equation 
(A  + u ><  v )w  = 0 is  satisfied  by  (and  only  by)  vectors  w = au  with  a arbitrary 
The  inhomogeneous  equations  (5.22)  and  (5.24)  then  have  solutions  only  if 


(5.40)  < v,z>  = < vA  > = <v,z>  = 0 , 
where  v = vA  ^ satisfies  the  transposed  homogeneous  equation 

(5.41)  v (A  + u ><  v)  = 0 . 

Theorem  5.3.  If  J C L(X,Y)  denotes  the  class  of  all  invertible  linear  opera- 
tors, and  3 c L (X , Y)  the  class  of  all  linear  operators  of  finite  rank,  then  all 
linear  operators  belonging  to  the  class  Cl  =J  ®3  have  a Fredholm  theory. 

Proof:  If  BC(?  ’ J®3,  then  there  is  an  ivertible  linear  operator  A e J 

for  which  B can  be  written  as 


(5.42) 


where 


B 


u ^ e Y , v^  £ X , j = 
and  functionals,  respectively 


A + 

j = l 

= 1,2,. 


I ><  V. 


,n,  are  linearly  independent  sets  of  vectors 


Equation  (2.3)  in  this  case  is  equivalent  to 
n 

w = z - £ u , <v^ ,w  > , 

j=l 


(S.43)  « 3 3 

“I  ^ -1  J 

A z , u_.  = A u_^,  j = 1,2, ,n.  Applying  the  functionals  v1'v2' <vn 

to(5.43)  in  turn  gives  the  equivalent  finite  linear  algebraic  system  of  equations 


where 


(5.44) 

for  C,  — < v.  ,w  > , where 


*i  + 


I a.  , i - 1,2, ...,n 

j=l  J l 

. . , = < v.  ,z  > and  a.  . = < v.  ,u . > 

i i i i 13  i 3 

As  the  Fredholm  alternative  applies  to  (5.44),  it  follows  that  operators  of  the  form 


i, j — 1,2,.. .,n. 


(5.42)  have  a Fredholm  theory.  QEP 

In  the  nonsingular  case,  an  expression  can  be  obtained  for  B 
rank  modification  of  A 1.  Let 
(5.45)  M 


-1 


as  a finite 


(6.  . + a.  .) 

13  13 

denote  the  matrix  of  coefficients  of  the  linear  system  (5.44),  where  is  the 

Kronecker  delta:  6^  = 0 if  i ^ j,  6^  = 1.  As  the  determinant  6 of  M is 
assumed  to  be  nonzero,  the  inverse  of  M may  be  written 


(5.46) 
and  thus 

(5.47) 

i = 1,2, . 


-1 


M 


i (fV 


< vi#w> 


1 n 

* l ^ 


6 A,  "ij  < Vj' 


,n. 


i n 

J l 

3*1 

Using  (5.43)  and  the  fact  that  < Vyi> 


< v^A 


z > 
-1 


z> 


< ,z  > for 
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v.  = v A"3-,  i = l,2,...,n,  one  obtains  the  solution  w of  (2.3)  in  this  case  as 
3 3 


(5.48) 


w = ( A ^ - i £ £ A 1u.  > 8.  . < V.A  1 ) 2 . 

6 i=l  j=l  1 13  3 


By  taking  appropriate  linear  combinations  u ,u  ,...,u  of  u ,u  ,...,u  and  v ,v  , 

x ^ ix  x ^ n x 4- 

...,v  of  v,  ,v. . . . ,v  (for  example,  by  an  ft/- decomposition  of  M (5,  pp.  27- 
n 1 2 n 

32)),  (5.48)  may  be  put  in  the  form 


(5.49) 


from  which 


i i n 

w=(A'1-|  l 


X l A_1£i  ><  V.A-1 

6 . , i 3 

3=1  J 


(5.50) 


n -1 

( A + l u ><  v ) 
j=l  3 3 


T I ><  V A'1  . 


which  is  analogous  to  (5.28). 

Another  way  to  find  the  inverse  of  the  perturbed  operator  (5.42)  is  the  method 
of  successive  rank  one  modifications , which  does  not  require  obtaining  M 3 explic- 
itly. Set 

(5.51)  bq  = A , b”1  = A'1  , 


and  then  the  algorithm 


(5.52) 


Bk  = Bk-1  + \ > < vk 


B'1  = B*1 
k k-1 


Bk-1  \ > < vk  Bk-1 


1 + < v,  B 


kVl  \ 


k = l,2,...,n  will  give  B 1 = B * if  none  of  the  intermediate  determinants 

n -1 

(5.53)  = 1 + < vk  Bk  x u^  > , k — 1,2, ...,n  , 

vanish. 

It  should  be  noted  again  that  it  is  not  necessary  to  obtain  A 3 to  solve  the 
equation  (2.3)  for 


(5.54) 


l Cj  u , 

j=l  3 3 


as  given  by  (5.43).  What  is  required  is  to  solve  equation  (1.1)  for  the  n+1  right- 

hand  sides  y = z,  u,,u„,...,u  for  x = z,u,  ,G_, . . . ,G  , calculate  the  coefficients 
1 2 n 1 2 n i, 

of  the  system  (5.44)  of  n equations  for  the  n unknowns  '^2' ' ' ' ' ^n'  so3ve 
system,  and  then  form  the  linear  combination  (5.54). 

An  important  application  of  the  above  technique  of  algebraic  perturbation  is  to 
the  numerical  solution  of  partial  differential  equations  by  what  is  called  the  capa- 
citance matrix  method  [36,  42).  The  basic  problem  is  to  solve,  for  example,  the 
Poisson  or  Helmholtz  equation  on  a region  J),  with  information  given  on  its  boundary 
2C1  (see  Figure  5.1).  The  use  of  finite-difference  methods  will  lead  to  a linear 
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Figure  5.1 

algebraic  system  B w = 2 which  may  be  very  laborious  to  solve.  On  the  other  hand, 
rapid  and  effective  methods  may  be  available  for  the  algebraic  system  Ax  = y corres- 
ponding to  the  same  finite-difference  approximation  to  the  problem  posed  on  an  en- 
closing rectangle  R with  boundary  8r.  By  regarding  the  algebraic  system  obtained 
for  fl  as  a finite  rank  perturbation  of  the  easily  solved  system  arising  from  the 

approximate  problem  on  R , a considerable  reduction  in  effort  may  be  possible.  Typi- 

2 

cally,  if  the  order  of  the  systems  (1.1)  and  (2.3)  is  about  n , then  the  rank  of  the 
perturbations  Aa  and  Ay  will  be  approximately  n . 

The  Fredholm  theory  will  now  be  shown  to  apply  to  operators  which  are  the  sum 
of  a continuously  invertible  operator  and  a compact  operator. 

Theorem  5.4.  Operators  A belonging  to  the  class  $ defined  by  (4.7)  have  a 
Fredholm  theory. 

Proof:  Choose  e < 1/ 1|  J 1||  . According  to  Definition  4.2,  the  compact  opera- 
tio  K may  be  written  as 

n 

(5.55)  K=S+  l u > < v , 

j=l  3 3 

where  n = n(e)  is  finite.  Thus, 

n 

(5.56)  A = J + S + l u.  ><  v.  , 

3=1  -1 

and  Theorem  5.1  guarantees  the  existence  of  the  inverse  operator  (J+S)  e L(Y,X). 

It  follows  from  Theorem  5.3  that  A has  a Fredholm  theory.  QED 

Theorem  5.4  provides  a basis  for  the  "kernel  splitting"  method  due  to  Erhard 
Schmidt  [4,  p.  155)  for  proving  the  Fredholm  Alternative  Theorem  [6]  for  the  linear 
integral  equation  (4.5).  Suppose  that  K(s,t)  is  continuous,  or  at  least  can  be 
approximated  sufficiently  well  by  a kernel  of  finite  rank  so  that  one  can  write 
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(5.57)  K(s,t)  = S(s,t)  + l u . (s) v . (t) 

j=1  3 3 x 

where  S(s,t)  is  the  kernel  of  a linear  integral  operator  S with  ||  S ||  < i 

IM 

the  appropriate  norm.  Then,  the  linear  integral  operator  I - Xk  in  (4.5)  has  the 


(5.58) 


Xk  = i - Xs  - X y u.  ><  v.  , 
• 3 3 


where  (I  - Xs)  = T(X)  exists  by  Theorem  5.1.  Applying  this  operator  to  the  equa- 
tion (I  - XK)x  = y,  the  integral  equation  (4.5)  is  seen  to  be  equivalent  to  the 
linear  equation 


(5.59) 


(I  - X l T (X)  u 


V ) X — T(X)y, 
1 


and  thus  I - XK  has  a Fredholm  theory  by  Theorem  5.3.  This  approach  regards  I-XK 
as  an  algebraic  perturbation  of  the  invertible  operator  I-XS. 

On  the  other  hand,  suppose  that 

n l X n 

(5.60)  (I  - X y u.  ><  v.)  = I + T y u.  ><  v.  = Z(X) 

j-1  3 3 * 3 3 

exists,  where  the  notation  (5.50)  has  been  used.  Then,  I - Xk  is  an  analytic  per- 
turbation of  an  invertible  operator,  and  (4.5)  is  equivalent  to  the  equation 

(5.61)  (I  - Xz(X)S)x  = Z ( X)  y . 

From  Theorem  5.1,  if  (I  - Xk)  * exists  and  ||  Xs||  < 1/ 1| ( X — Xk)  ^||,  then  Z(X) 
exists,  so  all  sufficiently  good  finite  rank  approximations  (5.57)  to  K(s,t)  will 
lead  to  a solvable  perturbed  equation 


(5.62) 


(I  - X y u.  ><  v.)w  = y 
j-1  3 3 


which  is  equivalent  to  a finite  linear  algebraic  system  of  the  form  (5.44)  . Convers- 
ly,  if  the  inverse  operator  Z(X)  exists  and  ||  Xz(X)s  ||  < 1,  then  it  follows  from 


the  same  theorem  that  (I  - XK) 


exists.  If  u1,xl2 un  and  vx'V2'"*'Vn  are 


chosen  so  that  all  the  inner  products  required  can  be  calculated  explicitly,  then 
this  gives  a method  for  concluding  the  existence  and  uniqueness  of  the  solution  of 
the  integral  equation  (4.5)  on  the  basis  of  a finite  set  of  algebraic  computations,'1 
as  well  as  a technique  to  obtain  approximate  solutions.  An  error  analysis  for  (5.62) 
may  be  carried  out  by  the  analytic  methods  of  §5a  with  Aa  = -Xs,  Ay  = 0.  A similar 
approach  can  be  used  on  (5.59)  with  T(X)  replaced  by 


(5.63) 


T (X)  = I + Xs  + X S + ...  + X S • 
k 


Setting  z = T^(X)y,  the  perturbed  equation 

n 

(5.64)  (I  - X y T ( X ) u . ><  v.)w  = z 

^ lr  i 
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may  be  analyzed  by  the  same  technique,  with 

. k+1 


Aa  = A l (As)  T(  A)  u . ><  v.  , 

1=1  3 3 


(5.65) 
and 

(5.66)  AT(A)y  = ~(AS)'w'LT(A)y  . 

As  ||  T (A)  ||  < 1/(1-  ||  As ||  ),  and  for 


,k+l 


(5.67) 
one  has 

(5.68) 
with 

(5.69) 


Af  = A l u.  5 


> < 


v.  v . 


AA 


AS 


j=l 


.k+1 


1-  Asl 


Af  I)  , ||  At(  A)  y ||  < 


As 


k+1 


1—  ||As  II 


>*11  < | A|  l ||  uj 

3=1 


3 " 3 

j--*- 

from  (5.67) . 

Anothe  analytic  approach  to  the  approximate  solution  of  (4.5)  which  also  yields 
error  bounds  is  to  solve  (5.61)  by  iteration,  as  ||  Az(A)s||  < 1 (29], 


6.  The  singular  case  and  generalized  inverses.  Attention  will  now  be  devoted  to 
linear  problems  which  are  ill-posed  because  the  linear  operator  involved  does  not 
have  a bounded  inverse.  As  the  solutions,  if  any,  of  ill-posed  problems  do  not  de- 
pend on  the  data  in  a continuous  fashion,  it  might  be  expected  in  this  situation  that 
analytic  perturbation  methods  will  be  of  little  utility,  or  can  be  applied  only  under 
very  restrictive  conditions.  For  example,  there  is  an  inherent  limitation  as  to  how 
well  an  operator  B e L(X,Y)  without  j continuous  inverse  can  be  approximated  by  an 
operator  A belonging  to  the  class  p C L(X,Y)  of  operators  with  continuous  in- 
verses A 3 C L(Y.X).  From  Theorem  5.1, 


(6.1) 


B - A! 


AA 


otherwise,  the  assumption  that  B i g would  be  contradicted.  Also,  from  (6.1), 

n .-In  .11 


.-1. 


(6.2) 
so  that 
(6.3) 

grow  without  limit  as 


= II  b-aII  (I  Aa(| 

and  the  approximate  condition  number 

> 1*1 

Aa| 


k ( A) 


B 


AA 


Aa ||  + 0.  Clearly,  computational  difficulties  can  be  expec- 

pected  in  the  calculation  of  A 3 or  in  the  solution  of  the  linear  equation  (1.1)  if 
A is  very  close  in  the  analytic  sense  to  an  operator  B which  does  not  have  a con- 
tinuous inverse. 


461 


Theorem  6.X.  If  {a  } C g is  any  sequence  of  linear  operators  such  that 

lim  || A -B  ||  = 0,  then  B / £)  if  and  only  if  (6.2)  holds  for  each  A = A , n = 
n -*•  00  ” 

1 1 2 / • ■ • • 

Proof:  If  B / 0 , then  it  has  already  beei.  shown  that  (6.2)  holds  for  each  A . 

a n 

To  show  the  converse,  suppose  that  B e p , ar.d  choose  n sufficiently  large  so  that 
||  AA  ||  «||  A -B  ||  < 1/2 1|  B ^||  . It  then  follows  from  (5.2)  that 


1 -||AA  II  -II  B~  II 


!||  < ' 


a contradiction  of  (6.2)  which  proves  the  theorem.  QED 

An  evident  drawback  of  analytic  perturbation  theory  is  that,  in  general,  no 
conclusions  can  be  drawn  from  the  existence  of  A ^ e L(Y,X)  about  the  invertibility 
or  noninvertibility  of  any  operator  B for  which  inequality  (6.1)  holds.  The  alge- 
braic theory,  on  the  other  hand,  states  that  if  B is  the  finite  rank  modification 

(5.42)  of  et  invertible  linear  operator  A e J,  then  B 1 exists  if  and  only  if 

(6.5)  6 = det(6^_.  + v^A  ^u^.  > ) / 0 . 

Of  course,  one  would  still  expect  computational  difficulty  if  B is  nearly  singular, 
especially  if  the  inner  products  * < v^A  *u  > , i,j  = 1,2,. ..,n,  can  only  be 

calculated  approximately. 

The  algebraic  approach  also  provides  information  in  the  singular  case.  Suppos- 
ing that  6=0,  consider  the  transposed  homogeneous  equation 

n 

(6.6)  t ( A + l u.  ><  v.)  = 0 
* 1=1  1 R 1 

for  t e X . Using  the  technique  of  §5b,  this  is  equivalent  to  the  finite  linear 

algebraic  system 


II 

r.  + y T.a.  . = 0, 

3 iix  1 ^ 


for  T.  = < t,u.  > . The  system  of  equations  (6.7)  is  the  transposed  homogeneous 

system  corresponding  to  (5.44).  If  6=0,  then  (6.7)  has  d linearly  independent 
solutions 

(6.8)  T(k)  = <Tik)  'T2k)  •••  • ’ k = 1,2 d' 

and,  corresponding  to  these,  equation  (6.6)  also  has  d linearly  independent  solu- 


(6  9,  t(k>  = l r'k’  v.a-1  = l 

i=l  i=l 

k - l,2,...,d.  Likewise,  the  homogeneous  system 


(6.10) 


+ I a.  X.  - 0. 

i L ii  i 

j=l  J J 


1,2,... ,n. 
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has  d linearly  independent  solutions 

r OO 


(6.11) 


(E(k)  (k) 

^1  'S  1 


r (k)  ) T 


1 § 2 , . . . , d f 


from  which  are  obtained  the  corresponding  linearly  independent  solutions 

(k)  c , (k)  -1  _ ? f(k)~ 

(6.12)  w = } A u . - l t. . u.  , 

3 ) 3 j 


j=l 


3=1 

k * 1,2, of  the  homogeneous  equation 

n 

(6.13)  (A  + l u.  ><  v.)w  = 0 . 

j=l  3 3 

Representing  the  right-hand  sides  of  the  system  (5.44)  as  the  vector 

.T  -1  ^ ^ .-1  ^ -1  ^.T 


(C,»C-»..-»C  ) = (<v  A z>,<v  A z>,...,<vA  z>) 

1 2 n 1 2 n 


(6.14)  C 

it  is  seen  immediately  that  the  conditions  for  the  solvability  of  the  finite  inhomo- 
geneous system  (5.44)  and  the  equivalent  inhomogeneous  equation  (2.3)  for  the  case 
6=0  are 

(6.15)  <T'"'.C>=<  ) <t;~v.,z>  = < t*"',z>  = 0 


. Ik)  . , y „(k)„  ^ ^ ^ (k)  . 
<T  ,4>  -<  l < T,  v ,z>  = < t ,z> 

i=l  1 


k = 1,2,..., d;  that  is,  z must  be  orthogonal  to  all  solutions  of  the  homogeneous 
equation  (6.6).  If  (6.15)  is  satisfied,  then  the  general  solution  of  (2.3)  may  be 
written  as 

a 

(k) 


(6.16) 


w = w + 


d 

I v 

k=l  K 


where  w is  some  particular  solution  of  (2.3),  and  the  complementary  vectors 

d 


(6.17) 


* = "(Va2 V = J-  V 

k=l 


(k) 


satisfy  the  homogeneous  equation  (6.13)  for  arbitrary  ,Q2 ' ■ " ‘ ,aj[  * 

Usually,  in  actual  computational  solution  of  linear  equations,  the  distinction 
between  the  singular  and  nonsingular  cases  is  not  as  clear-cut  as  in  the  alternatives 
(1.1a) or  the  Fredholm  theory.  In  practice,  an  objective  or  subjective  standard  is 
set  for  what  constitutes  an  "acceptable"  (approximate)  solution,  and  one  of  the  fol- 
lowing situations  is  observed: 

(i)  An  acceptable  solution  is  obtained, 
or 

(ii)  either  no  solution  at  all  is  found,  or  the  computed 
solution  is  unacceptable. 


(6.18) 


In  the  computationally  singular  case  (6.18ii),  the  method  used  to  solve  (1.1)  or 
invert  A may  break  down  because  A does  not  have  a bounded  inverse,  or  is  analyti- 
cally close  to  an  operator  B / ^ . On  the  other  hand,  the  algorithm  employed  may 
actually  be  trying  to  solve  the  system  (5.44)  with  6 = 0 and  without  (6.15)  holding 
to  the  desired  degree  of  accuracy.  This  will  be  called  an  algebraic  catastrophe  of 
type  I . In  the  second  situation  described  in  (6.18ii),  the  acceptable  particular 
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solution  w may  be  contaminated  by  a complementary  vector  (6.17)  to  the  extend  that 
the  resulting  solution  is  unacceptable.  This  algebraic  catastrophe  of  type  IT  can 
occur  in  the  numerical  solution  of  differential  equations  by  the  use  of  approximating 
difference  equations.  For  example,  the  difference  equation 

(6.19)  3u  . + 8u  - 3u ' = 0 

n+1  n n-1 

with  the  initial  conditions 

(6.20) 

has  the  bounded  solutions 

(6.21)  u_  = (t)"  , n = 0,1,2,..., 


U0  = U1  = 3 * 


In 

u = (-) 
n 3 


which  may  be  the  ones  considered  to  be  acceptable.  However,  a slight  perturbation' of 
(6.20),  such  as  rounding  to  eight  decimal  places, 

(6.22)  wQ  =1,  w1  = 0.33333333 

gives  the  corresponding  solutions  w of  3w  . + 8w  - 3w  , = 0 as 

n n+1  n n-1 


(6.23) 


(0.999999999) (y)n  + (0.000000001) (-3)" 


n = 1,2,...,  and  the  second  term  on  the  right-hand  side  of  (6.23)  will  eventually 

wreak  havoc  with  the  accuracy  of  the  approximation  of  u by  w 

n n 

As  indicated  in  §lb,  if  the  operator  A is  singular,  then  a generalized  inverse 
A^  of  A having  certain  useful  properties  may  be  sought,  for  example,  to  give  a 
solution  of  (1.1)  in  the  form  (1.4)  if  (1.1)  is  consistent.  As  (1.5)  indicates,  the 
vector  x = A^y  will  be  a particular  solution  of  (1.1)  for  any  inner  inverse  A^  of 
A.  An  algebraic  perturbation  method  may  be  used  to  obtain  inner  inverses  of  singular 

operators  which  have  a Fredholm  theory,  under  the  technical  assumption  that  the  space 

** 

Y is  reflexive,  that  is,  Y = Y [38,  p.192].  In  this  case,  if 

* , * * *,  * 

(6.24)  U = {u,  ,u  uj  C y 

is  a set  of  linearly  independent  functionals  on  Y , then  the  Hahn-Banach  theorem 
guarantees  the  existence  of  a set  of  d linearly  independent  vectors  in  Y to  which 
the  Gi am- Schmidt  orthonormalization  process  [38,  p.  116]  may  be  applied,  if  neces- 
sary, to  obtain  the  set 

(6.25) 
for  which 

(6.26) 


U = {u1 ,u2, . . . ,ud)  c Y 


< u^>uj  > = <ui,uj  ■>  = ^ j • i , j = 1,2 d , 

where  6^  again  denotes  the  Kronecker  delta.  Similarly,  given  a set  of  linearly 
independent  vectors 

(6.27) 

a set  of  functionals 

(6.28) 


V = »V  , •• • 'vd^  C x' 


r 

{Vl,v2,...,vd}  c 
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exists  such  that 

* 

(6.29)  < > = l5ij.  i.j  = 1,2,.  ..,d, 

whether  X is  reflexive  or  not. 


Theorem  6.2. 

Suppose  that 

A e I,(X,Y) 

has 

a Fredholm  theory, 

and 

(6.30) 

u A = Av 

= 0 

* 

if  and  only  if  u e 

r * * 

span  iu  ,u  , . 

* * 

. . , u ) c y 

d 

•»nd 

v e span{v  , v , . . 

. , v } C x,  where 
d 

the  defect  d of  A 

* 

is  positive. 

Then,  for 

\ e 

U and  v,  € V , 
k 

k = 1,2, d. 

where  U and  V are  defined  by  (6. 24) - (6. 29) , the  operator 


(6.31)  B = A - l u ><  v* 

k=l 


is  invertible,  and 

(6.32) 


AB_1A  = A 


so  that  A^  = B 1 is  an  inner  inverse  ol  A . 

Proof:  To  show  that  B is  invertible,  consider  the  homogeneous  equation  Bz  = 

0,  which  is  equivalent  to 

(6.33) 


Az 


d 

= A \ 


k=l 


< Vk'Z  > 


As  this  equation  i$  solvable  if  and  only  if  the  right-hand  side  is  orthogonal  to 

* * * 

u, ,Uj , . . . ,u^  because  A has  a Fredholm  theory,  it  follows  from  (6.26)  that 


1 

(6.34) 


< v^ , z > =0,  k=l,2,...,d. 


and  thus  Az  = 0.  This  means  that 
where  the  coefficients 


is  of  the  form  z = a^v^  + U2V2  + 


+ “dVd 


are  given  by  (6.34),  and  hence  z 


is  the  unique 

solution  of  the  homogeneous  equation  Bz  « C,  which  implies  the  existence  of  B 

To  prove  (6.32),  note  that  from  (6.29),  (6.30),  and  (6.31), 

d 


(6.35) 

hence 

(6.36) 
and 

(6.37) 


Bv. 

1 


B_1A 


Vvi 


k=l 


= -u.. 


i = 1,2,... ,d  , 


k’ 


k = 1,2, ... ,d. 


B_1(B  + 


l \ 

k-1 


< v.  ) = I - 
k 


k=l 


Vk>< 


and  (6.32)  follows  di  -ctly  from  (6.30). 


QED 


Instead  of  (6.36),  one  could  also  use  the  relationships 

(6.38)  v B 1 = -ir  , k = 1,2, ...,d, 

K * -1  f 

to  establish  (6.32).  The  operator  B = A obtained  from  (6.31)  is  called  Hurwitz 
pseudoinverse  of  A (31,  pp.  165-168;  12],  which  goes  back  to  1912. 

By  the  same  reasoning  as  above,  any  operator  of  the  form 


(6.39) 


r * -1 

= <A  " l \ > Bk  < v ) 

k-1  X 
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for  8. ,8-. .8,  such  that  8,8-.. -8,  f 0 will  be  an  inner  inverse  of  A . How- 

12  a 12a 

ever,  as  these  operators  are  invertible,  they  cannot  satisfy  condition  (2)  of  §lb 
which  characterizes  outer  inverses;  consequently,  the  construction  (6.39),  while  use- 
ful for  some  purposes,  only  gives  a partial  solution  to  the  problem  of  finding  gener- 
alized inverses. 

Another  matter  of  computational  importance  relates  to  the  calculation  of  gener- 
alized inverses  of  perturbations  of  operators  with  known  generalized  inverses.  Sup- 
pose, for  example,  that  one  has  an  efficient  technique  to  obtain  the  Moore-Penrose 
generalized  inverse  A^  of  A (27) , and  then  would  like  to  use  the  result  to  obtain 
the  generalized  inverses  of  perturbed  operators  B = A + Aa  with  less  effort  than 
calculating  ab  initio  , or  error  bounds  for  the  approximation  of  B by  A 

As  A^  is  not  a continuous  function  of  A in  general,  it  would  be  expected  that 
analytic  perturbation  methods  apply  only  under  restrictive  conditions,  as  even  for 
||  AA j | arbitrarily  small,  one  of  the  algebraic  catastrophes  that  the  rank  of  B is 
greater  or  less  than  the  rank  of  A could  occur.  Most  applications  of  analytic  per- 
turbation theory  to  the  above  problems  are  carried  out  under  assumptions  that  ensure 
rank(B)  = rank(A),  or  that  the  change  in  rank  is  known  (23,  pp.  333-351).  Algebraic 
perturbation  methods,  on  the  other  hand,  are  not  necessarily  subject  to  this  kind  of 
limitation.  For  rank  one  modifications  of  A , C.  D.  Meyer,  Jr.  (19;  23,  pp.  351- 
3521  has  obtained  formulas  of  the  type 
(6.40)  ( A + u ><  v)  = A + G 

for  all  six  possible  cases,  where  G depends  on  A^  and  the  data.  More  general 
finite-rank  modifications  (5.42)  of  A can  then  be  handled  by  the  method  of  succes- 
sive rank  one  modifications  corresponding  to  (5. 51 ) - (5 . 52) . This  latter  algorithm 
was  originated  by  Greville  (9)  for  the  recursive  calculation  of  the  Moore-Penrose 

generalized  inverse  of  a matrix.  Formula  (6.40)  reduces  to  (5.28)  in  the  special 

i -1 

case  that  A is  invertible,  as  for  any  generalized  inverse  of  A , A = A for 
all  A e J . This  suggests  the  computational  strategy  of  using  a method  for  gener- 
alized inversion  on  an  operator  which  is  suspected  of  being  singular  or  nearly  singu- 
lar. If  the  operator  or  the  perturbed  operator  actually  involved  in  the  calculation 
is  nonsingular,  then  this  technique  will  yield  its  inverse,  whereas  a straightforward 
inversion  method  might  fail. 

Another  approach  to  ill-posed  problems  is  to  approximate  them  by  a perturbed 
problem  which  is  well  conditioned.  An  example  is  the  technique  of  regularization, 
due  to  A.  N.  Tihonov  [39,  40),  which  has  close  connections  with  the  subject  of  gener- 
alized inverses  (22).  If  the  operator  A in  (1.1)  does  not  have  a bounded  inverse, 
then  the  smallest  perturbation  Ay  in  the  data  can  cause  an  enormous  change  Ax  in 
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the  solution  of  the  perturbed  problem  (2.3)  ar,  compared  to  the  solution  of  the  refer- 
ence problem.  A typical  situation  in  which  problems  of  this  type  arise  in  applica- 
tions is  that  X and  Y arc  Hilbert  space'',  and  A = K is  a compact  operator.  The 
prototype  of  the  resulting  equation 

(6.41)  Kx  = y , -•  - < , 

is  the  linear  Fredholm  integral  equal  1 ,n  of  ' first  kind  , 

(6.42)  J1  K(s,t)x(t)dt  = y(s),  0 s 1 . 

As  perturbations  in  (6.42)  in  actual  practice  are  inevitable,  due  to  errors  of  meas- 
urement, discretization,  and  computation,  direct  numerical  solution  of  (6.42)  by 
standard  techniques  that  work  well  for  the  integral  equation  (4.5)  of  second  kind  are 
rarely  successful.  The  same  obsc  "at  ion  may  be  made  for  (6.41)  as  compared  to 

(6.43)  (ml  - K) x = y 

for  a jt  0.  In  order  to  find  an  acceptable  approximate  solution  of  the  perturbed 
version  of  (6.41),  the  method  of  regularizat ion  consists  of  finding  an  element 
w(a)  e X v hich  minimizes  the  functional 

(6.44)  1 f (w;it)  = II  Kw  - z||  ‘ * a2 1|  w||  . 

Thus,  (6.44)  represents  a trade-off  between  the  fidelity  with  which  the  perturbed 

equation  Kw  = z is  satisfied,  and  the  size  of  the  norm  of  the  corresponding  solu- 

2 

tion.  The  parameter  oc  (or  sometimes  a*-)  in  (6.44)  is  called  the  regularization 
parameter  . The  crucial  problem  in  this  field  is  the  determi nation  of  the  optimal 
regularization  parameter,  for  which  the  value  of  f(w;(i)  is  minimum,  or  at  least  a 
method  for  obtaining  good  approximations  to  the  optimal  value.  A significant  recent 
advance  in  this  area  is  the  application  by  Cra"o  Wahba  [41]  of  the  method  of  weighted 
cross-validation  to  the  case  that  the  perturbation  is  due  to  discretization  of  the 
data  with  random  errors  of  the  type  known  as  "white  noise". 

7.  The  eigenvalue-eigenvector  problem.  As  stated  in  §lc,  this  problem  is  to  find 
eigenvalues  A and  right  eigenvectors  x / 0 satisfying  (1.9),  where  A e L(X,X), 

X a Hilbert  space.  It  follows  that  one  is  interested  in  the  values  of  X for  which 
the  linear  operator 

(7.1)  T ( X ) = A - Xl 

is  singular,  and  one  may  also  want  to  find  the  left  eigenvectors  y / 0 of  A 
corresponding  to  the  eigenvalue  X which  satisfy  the  homogeneous  equation 

(7.2)  y (A  - Al)  ^ 0 . 

The  additional  assumption  will  be  made  that  the  values  of  X considered  are  re- 
stricted to  those  for  which  T(X)  has  a Fredholm  theory.  This  condition  does  not 
exclude  any  X in  the  finite-dimensional  algebraic  case;  however,  for  Fredholm  inte- 
gral operators  of  the  first  kind  or  compact  operators  in  general,  it  is  customary 
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to  formulate  the  eigenvalue-eigenvector  problem  in  terms  of  the  reciprocal  eigenvalues 
y = 1/A,  as  the  operator 

(7.3)  S (li)  = X - PK,  K f y , 

will  have  a Fredholm  theory  for  all  scalars  u ' A by  Theorem  5.3.  This  is  equiva- 
lent to  excluding  A = 0 from  ecu  ! deration  in  (7.1)  if  A is  compact. 

In  order  to  contemplate  the  application  of  analytic  perturbation  methods  to  the 
eigenvalue-eigenvector  problem,  it  is  essential  to  determine  conditions  under  which 
this  problem  is  well-posed,  as  the  operator  T(A)  will  be  singular  if  A is  an  eigen- 
value. One  way  to  do  this  is  to  convert  equation  (1.9)  and  the  normalization  condi- 
tion (1.10)  into  the  nonlinear  system 


P(q)  := 


Ax  - Ax  \ 

1=  0 

1 


in  the  product  space  Q = X x A of  vectors  q = (x,A)  , x e X,  A c A . Suppose 
T 

that  q^  - (x^,A^)  is  a solution  of  (7.4);  that  is,  A^  is  an  eigenvalue  of  A , 
and  x^  is  a corresponding  normalized  eigenvector.  Then,  the  implicit  function 
theorem  [10]  guarantees  continuous  dependence  of  the  solution  of  (7.4)  on  the  data  if 
the  linear  operator  P1 (q^)  eL(Q,Q)  has  a bounded  inverse,  where  P’ (q)  is  the  Frachet 


deri vat ive 


P’  (q)  = 


A - Al 


of  the  operator  P at  q [30,  pp.  97-100],  The  formulation  (7.4),  while  not  the 

* 

most  general  [1],  has  the  advantage  that  if  A is  Hernitian  (A  = A [38,  pp.  32 


A [38,  pp.  324- 


327]),  then  so  is  P’ (q) . The  following  theorem  gives  an  explicit  formulation  of  the 

inverse  operator  [P’(q^)]  * in  this  case  if  the  defect  of  T(A)  is  equal  to  one, 

that  is,  if  all  solutions  x of  the  homogeneous  equation  T(A^)x  = 0 are  scalar 

multiples  of  the  normalized  eigenvector  x^  , making  use  of  the  fact  that  the  right 

and  left  eigenvectors  of  an  Hermitian  operator  can  be  identified. 

T 

Theorem  7.1.  If  A is  Hermit lan,  q^  = (x^ , A^)  satisfies  (7.4),  and  the 


If  A is  Her.ri  t . an , 


satisfies  (7.4),  and  the 


defect  of  T(A^)  is  equal  to  one,  then 

_x  ( ^ ^ X> 
(7.6)  [P’(q .)]  = 1 

V 


= (A  - A,  1 - x,  ><  x,)_1 


is  the  Iturwitz  pseudoinverse  of  A - A I. 
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Proof:  It  follows  by  direct  calculation  and  the  use  of  (6.36)  and  (6.38)  that 


the  identity  operator  in  Q = X * A QED 


By  the  use  of  Theorem  6.2,  formula  (7.6)  can  be  extended  immediately  to  the  non- 
Hermitian  case  y^T(A^)  = T(A^)x^  = 0,  provided  the  defect  of  T(A^)  remains  equal 
to  one  [1,  §3].  Under  these  circumstances,  results  are  available  by  the  methods  of 
analytic  perturbation  theory  similar  to  those  for  nonsingular  linear  equations  (1.1) 

(1,  §5]. 

For  the  finite-dimensional  case,  perturbation  methods  and  error  analysis  for 
the  algebraic  eigenvalue  problem  have  been  presented  in  great  detail  in  the  compre- 
hensive work  by  J.  H.  Wilkinson  (44,  pp.  62-188],  Just  one  of  these  results  will  be 

cited  here,  which  fits  into  the  framework  of  algebraic  perturbation  theory.  Suppose 

T 

that  w is  a unit  vector,  and  p = !w,y)  is  an  approximate  solution  of  (7.4),  so 
that 

(7.9)  (A  - pl)w  = r , 

with  residual  vector  r . From  equation  (5. 38) , it  follows  that 

(7.10)  (A  - r ><  w*  - )il)w  = 0 , 

so  that  w is  an  exact  eigenvector  of  the  perturbed  operator 

(7.11)  B = A - r ><  w* 

corresponding  to  the  eigenvalue  U [44,  pp.  170-171).  The  perturbed  operator  B is 
simply  a rank  one  modification  of  the  reference  operator  A . 

Another  application  of  algebraic  perturbation  theory  to  the  eigenvalue-eigen- 
vector problem  has  been  given  by  W.  Stnnger  J37)  to  find  inequalities  between  eigen- 
values of  perturbed  and  reference  integral  operators. 

*• 

8.  Linear  programming.  The  solution  of  linear  programming  problems  as  formulated 
in  §ld  is  one  of  the  primary  tools  for  decision  making  in  government  and  commerce  at 
the  present  time  (8).  The  number  of  variables  involved  is  typically  large,  and  a lot 
of  computer  time  is  expended  for  this  purpose.  Thus,  an  application  of  perturbation 
theory  which  would  increase  efficiency  could  result  in  substantial  savings.  Once 
again,  the  fact  that  the  solutions  do  not  depend  continuously  on  the  data  in  general 
limits  the  applicability  of  analytic  perturbation  techniques.  A necessary  and 


469 


r 


sufficient  condition  for  continuous  dependence  of  the  solution  of  the  primal  and  dual 
linear  programming  problems  in  a neighborhood  of  solvable  reference  problems  has  been 
given  recently  by  S.  M.  Robinson  [34]  . Studies  of  what  is  called  parametric  pro- 
gramming give  conditions  under  which  the  solution  of  the  reference  problem  remains 
unchanged  under  perturbation  of  the  data  [8,  pp.  144-154],  On  the  subject  of  error 
estimation,  P.  Wolfe  [45]  has  contributed  a method  for  error  analysis  and  control  in 
the  solution  of  linear  programming  problems. 

Although  changes  in  the  objective  function  (1.14)  are  not  usually  difficult  to 
deal  with,  perturbations  in  the  constraints  (1.15),  as  would  result,  for  example,  by 
the  introduction  of  a new  technology  in  an  industry,  may  require  the  complete  re-  • 
starting  of  the  solution  method  used.  Consequently,  the  following  problem  may  be  of 
practical  interest. 

Problem  8,1.  Given  the  solution  x of  (1 . 14) - (1 . 15)  and  the  associated  infor- 
mation, such  as  the  choice  of  pivots  in  the  simplex  algorithm  [45] , find  an  efficient 
method  for  solving 

(8.1)  minimize  f -,w)  :»  < d,w  > + r) 

subject  to 

(8.2)  B w < z , w > 0 , 

where  all  perturbations  in  the  reference  data  are  of  finite  rank  which  is  small  com- 
pared to  the  size  of  the  reference  problem. 

9.  Nonlinear  problems.  Although  this  survey  has  been  concerned  mainly  with  linear 
problems,  it  should  be  mentioned  that  perturbation  methods  are  widely  applied  to  the 
solution  of  nonlinear  operator  equations 

(9.1)  P(x)  = 0 , 

where  P maps  X into  Y , and  also  fixed  point  problems  in  X of  the  form 

(9.2)  x = H (x)  . 

(It  is  evident  that  (9.2)  is  a special  case  of  (9.1);  conversely,  there  are  many  ways 
to  convert  (9-1)  into  an  equivalent  fixed  point  problem.) 

These  problems  are  well-posed  in  the  neighborhood  of  a solution  x^  if,  for  ex- 
ample, H is  continuous  and  contractive  [30,  Chapter  2],  or,  more  restrictively , if 
P is  differentiable  and 

(9.3)  rQ  = [P1 (x0)]“X  e L (Y,X) . 

Depending  on  the  smoothness  of  P , in  this  case  one  can  base  analytic  perturbation 
techniques  on  the  implicit  function  theorem  [10],  Newton's  method  and  its  variants, 
Taylor  series  expansions,  inversion  of  power  series,  and  so  on  [30,  Chapter  4]. 

These  methods  are  all  essentially  derived  from  the  corresponding  ideas  of  elementary 
scalar  calculus. 
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Recently,  W.  Rheinboldt  has  given  generalizations  of  the  condition  numbers  (5.8) 
and  (5.9)  for  nonlinear  operators  for  which  (9.3)  holds,  and  a corresponding  gener- 
alization of  the  perturbation  formula  (5.13)  for  error  estimation  (33). 

Algebraic  perturbation  methods  for  nonlinear  operator  equations  are  less  well 
investigated.  A nonlinear  operator  F with  range  belonging  to  the  finite-dimensional 
space 

(9.4)  Y = span  (y  , y , ...,y  } 

n l z n 

will  be  of  the  form 

n 

(9.5)  F ( • ) = l y > f . (•)  , 

j=l  3 3 

where  f,  (•) ,f_ (•),..., f (•)  are  (generally  nonlinear)  functionals  on  X . The  per- 
1 2 n 

turbed  operator  equation 

(9.6)  C(x)  = 0 , 

where  Q = P - F,  is  equivalent  to  the  equation 

n 

(9.7)  l ,x)  = 
where 

(9.8)  £ = f (x)  , j = 1 ,2,  . . . ,n  . 

Suppose,  and  this  is  the  big  assumption , that  the  equation  P(x)  = y is  solvable  for 

y e Y , that  is,  an  operator  G is  known  which  gives 
n 

(9.9)  x = G(£_,£_,...,£  ) 

l z n 

if  P(x)=  y is  of  the  form  (9.7).  Then,  applying  f ^ , f ^ , . . . , f^  in  turn  to  (9.9) 
yields  the  nonlinear  system 

(9.10)  £.  = h.  (£,,£_,...,£  ) , i = 1,2, ...,n  , 

i i i z n 

where  h,  = f,G,  h = f_G,...,h  = f G , which  is  a finite-dimensional  fixed-point 

112  2 n n 

problem  in  A or  the  form  (9.2).  On  the  basis  of  the  additional  assumption  that 

(9.10)  is  solvable,  the  substitution  of  its  solutions  £,,£_,••.,£  into  (9.9)  pro- 

12  n 

vides  a solution  x of  the  nonlinear  operator  equation  (9.6).  As  an  example  of  this 
approach,  the  Harnmerstein  integral  equation  with  kernel  (5.31) 

(9.11)  x (s)  - J1  K (s , t ) <J>  (t , x (t) ) dt  = 0 

0 

is  a rank  one  modification  of  the  nonlinear  Volterra  integral  equation 

(9.12)  x(s)  - /S  L(s, t)^ (t,x(t) )dt  = 0 

0 

with  kernel  (5.33).  Thus,  if  one  can  solve 

(9.13)  x(s)  - /S  L(s,t)<Mt,x(t)  )dt  = £u(s)  , 

. 0 
where 

(9.14)  £.  " J1  v(t)  $(t,x(t)  )dt  . 

0 


j=l 


Vi 
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for  x(s)  = g(s;£),  then  from  (9.14),  the  system  (9.10)  is  equivalent  to  the  scalar 
fixed  point  problem 

(9.15)  £ = h(£)  :=  J1  v(t)<)>(t,g(t;£)) dt  , 

0 


which  is  one  nonlinear  equation  in  one  unknown  132,  §5]. 

Although  quite  a bit  is  known  about  nonlinear  systems  (9.10)  in  finite-dimen- 
sional spaces  [28],  the  theory  and  practice  of  their  solution  is  far  from  the  highly 
developed  technology  available  for  finite  linear  systems  (5.44).  There  is  also  the 
ever-present  big  assumption.  Even  though  (9.9)  is  not  obtainable  explicitly,  the 

form  of  the  problem  (9.7)  suggests  iteration:  Solve  (9.7)  for  given  £,°^ ,£ , . . . , 
(0)  12. 

£ , substitute  into  (9.10)  to  obtain 

,q  f(l)  _ . , r (0)  r(0)  f(0>,  . , , 

i 112  n 

and  so  on.  In  the  case  that  (9.6)  is  a boundary -value  problem  for  a nonlinear  differ- 
ential equation,  this  is  called  "shooting"  (13,  Chapter  2,  also §6.1).  Of  course,  this 
iteration  may  not  converge,  and  some  other  method  for  solving  (9.6)  may  be  more  appropriate  . 

This  section  will  also  conclude  with  an  important  problem,  as  much  more  work 
needs  to  be  done. 

Problem  9.1.  For  differentiable  P , develop  existence  theory  and  find  effec- 
tive techniques  for  computing  solutions  x of  the  nonlinear  operator  equation  (9.1) 


in  the  case  that  P' (xQ) 


does  not  have  a bounded  inverse. 
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SOLUTION  OF  PARTIAL  DIFFERENTIAL  EQUATIONS 
ON  VECTOR  COMPUTERS 

James  M.  Ortega 
Robert  G.  Voigt 


ABSTRACT 

In  this  paper  we  review  the  present  status  of  numerical  methods 
for  partial  differential  equations  on  vector  computers,  that  is, 
computers  with  hardware  instructions  for  vector  operands.  Both 
direct  and  iterative  methods  are  for  elliptic  equations  as  well 
as  explicit  and  implicit  methods  for  initial  boundary  value 
problems.  The  intent  is  to  point  out  attractive  methods,  as  well 
as  areas  where  this  class  of  computer  cannot  be  fully  utilized 
because  of  either  hardware  restrictions  or  the  lack  of  adequate 
algorithms. 


This  paper  is  the  text  of  an  invited  survey  talk  presented  at 
the  1977  Army  Numerical  Analysis  and  Computers  Conference.  The 
preparation  of  this  paper  was  supported  under  NASA  Contract 
NAS  1-14101  while  the  authors  were  in  residence  at  ICASE,  NASA 
Langley  Research  Center,  Hampton,  VA.  23665 
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I.  Introduction 


For  the  past  15  years,  there  has  been  interest  in  the  use  of  computers 
with  a parallel  or  pipeline  architecture  for  the  solution  of  very  large 
scientific  computing  problems.  As  a result  of  the  impending  implementation 
of  such  computers,  there  was  considerable  activity  in  the  mid  and  late  1960's 
in  the  development  of  parallel  numerical  methods.  Some  of  this  work  is 
summarized  in  the  classical  review  article  of  Mi  ranker  [1971].  It  has  only 
been  in  the  period  since  then,  however,  that  such  machines  have  become  avail- 
able. The  II.LIAC  IV  was  put  into  limited  operation  at  NASA’s  Ames  Research 
Center  in  1972;  the  first  t'exas  Instrument  Advanced  Scientific  Computer 
(TI-ASC)  became  operational  in  Ktirope  in  1972  primarily  for  seismic  calcu- 
lations; the  first  Control  Data  Corporation  STAR-100  was  delivered  to 
Lawrence  Livermore  Laboratory  in  1979;  and  the  first  Cray  Research  Corpora- 
tion CRAY-1  was  put  into  service  at  Los  Alamos  Scientific  Laboratory  in 
1976.  (A  summary  of  the  basic  characteristics  of  these  four  machines  will 
be  given  in  Section  2).  There  have  also  been  a number  of  other  parallel 
configurations  designed  primarilv  for  other  tvpes  of  applications  - such 
as  Goodyear  Corporation'  SiAK.,.\  (Goodyear  |1974],  Gilmore  [1971],  Rudolph 
[1972])  and  the  PKPK  system  (Berg,  et  al.  [ 19721)  - or  for  research  purposes, 
such  as  i lu  C.mmp  stem  a 1 irnegie-M<  lion  University  (Wtilf  and  Bell  I 1972  ] ) . 

The  I LI,  1 AC  IV,  TI-ASC,  CPC  STAR-100,  and  CRAY-1,  although  differing 
sometimes  considerably  in  their  architectures,  are  all  examples  of  vector 
computers;  that  is,  they  have  hardware  instructions  which  accept  vectors  as 
operands.  WiLh  the  exception  of  the  ILI.TAC  IV,  they  also  have  the  usual  scalar 
operations.  Under  optimum  conditions,  these  machines  are  capable  of  producing 
floating  point  results  at  rates  up  to  50  to  200  million  per  second.  Here, 
"optimum  conditions"  var  1 somewhat  amongst  the  machines  but,  roughly,  it  means 
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operating  on  vectors  of  fairly  long  lengths  (10000  or  longer)  on  the  ASC  or 
STAR-100,  or  of  lengths  which  are  multiples  of  64  on  the  ILLIAC  IV  or  CRAY-1. 
This  will  be  elaborated  on  in  the  sequel. 

As  an  example  of  the  gain  that  can  be  achieved,  experiments  at  Langley 
Research  Center  showed  a speed-up  of  approximately  30  to  1 for  the  STAR-100 
over  a CDC  6600  on  the  solution  of  300  x 300  linear  systems  by  Gaussian 
elimination.  Both  codes  were  written  in  assembly  language  and  made  as  opti- 
mal as  possible. 

In  practice,  it  is  difficult  to  achieve  optimum  conditions  and  the 
challenge  for  the  numerical  analyst  is  to  devise  algorithms  which  utilize 
as  much  as  possible  the  fast  vector  processing  capabilities.  Stone  [1973b] 
has  given  the  following  general  dictums: 

1.  Data  must  be  arranged  (and  possibly  rearranged)  for  efficient 
computation. 

2.  Efficient  serial  algorithms  are  not  necessarily  good  for  parallel 
machines  and,  conversely,  inefficient  serial  algorithms  may  lead  to  efficient 
parallel  algorithms. 

3.  Some  algorithms  may  appear  to  be  inherently  serial  but  may  be 
transformed  to  efficient  parallel  algorithms. 

There  are  as  yet  very  few  papers  in  print  which  describe  in  detail  the 
application  of  vector  computers  to  realistic  problems  and  some  of  these 
appeared  before  the  machines  themselves  were  available.  For  example,  Carroll 
and  Weatherald  [1967]  discuss  the  possible  application  of  the  Solomon  computer 
the  predecessor,  which  was  never  built,  of  the  ILLIAC  IV  - to  hydrodynamics 
problems  and  general  circulation  weather  models  in  particular;  Reilly  [1970] 
considers  a Monte  Carlo  method  for  the  Boltzmann  equation  on  the  ILLIAC  IV; 
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and  Ogura,  Sher,  and  Ericksen  [1972]  review  the  theoretical  efficiency  of 
the  ILLIAC  IV  for  hydrodynamics  calculations. 


More  recently,  Wilhelmson  [1974],  and  Erickson  and  Wilhelmson  [1976]  have 
considered  convection  problems  - and  in  particular  the  Benard-Rayleigh  pro- 
blem - on  the  ILLIAC  IV;  they  use  Dufort-Frankel  differencing  on  the  diffusion 
terms,  a scheme  of  Lilly  for  the  convection  terms,  a fast  Fourier  method  for 
the  Poisson  equation,  and  leap-frog  differencing  in  time.  One  of  the  main 
thrusts  of  their  work  is  a proper  balancing  of  computation  with  disk  to  core 
transfers . 

Davy  and  Reinhart  [1975]  discuss  the  application  of  the  ILLIAC  IV  to 
a chemically  reacting,  inviscid  hypersonic  flow  problem,  using  MacCormack's 
method  with  shock  capturing.  McCulley  and  Zaher  [1974]  report  on  the  solu- 
tion of  diffusion  type  equations  on  the  ILLIAC  IV  in  a problem  that  arises 
in  planetary  entry.  Boris  [1976]  applies  his  flux-corrected  transport  (FCT) 
algorithm  to  continuity  type  equations  on  the  TI-ASC;  he  concludes  that 
the  FCT  method  is  "fully  vectorizable." 

Lambiotte  and  Howser  [1974]  compared  the  ADI  method,  Brailovskaya's 
method  [1965],  and  Graves'  Partial  Implicitization  method  [1975]  on  the 
CDC  STAR- 100  for  the  driven  cavity  problem  and  conclude  that  both  Braa  a's 

method  and  Graves'  method  vectorize  well  and  are  the  fastest  on  the  STAR-idO 
even  though  the  ADI  method  was  the  fastest  on  a serial  machine.  Weilmunster 
and  Howser  [1976]  consider  a boundary  layer/shock  interaction  calculation 
governed  by  the  full  Navier-Stokes  equations  in  2 dimensions.  They  report 
speed-ups  of  as  much  as  65  to  1 on  the  STAR  over  a corresponding  program  on 
a 6600  (but  the  6600  program  used  the  RUN  compiler  which  produces  object 
code  that  normally  runs  about  twice  as  slow  as  the  FTN  compiled  code  and  the 
speed-up  is  more  correctly  on  the  order  of  30  or  35-1). 
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Much  more  work  is  now  in  progress  at  the  various  laboratories  which 
have  vector  machines  and  we  can  hope  to  see  many  additional  insights  and 
comparisons  in  the  near  future.  Rather  than  report  in  detail  on  any  par- 
ticular problems,  it  is  the  purpose  of  this  paper  to  review  some  of  the 
basic  ideas  involved  in  solving  partial  differential  equations  on  vector 
computers.  A related  survey  is  by  Heller  [1977]  which  concentrates  on 
linear  algebra  computations  and  gives  more  emphasis  to  theoretical  questions 
about  parallel  computational  processes.  We  mention  also  the  annotated 
bibliography  of  Poole  and  Voigt  [1974]  which  was  an  essentially  complete 
listing  of  works  pertaining  to  numerical  methods  for  vector  and  parallel 
computers  up  to  the  time  of  its  publication. 

After  summarizing  the  basic  architectural  features  of  vector  computers 
in  Section  2,  we  consider  in  Section  3 direct  methods  for  the  solution  of 
the  discrete  systems  of  algebraic  equations  which  result  from  elliptic  boun- 
dary value  problems.  In  Section  4,  we  treat  iterative  methods  for  elliptic 
equations  as  well  as  marching  methods  for  initial-boundary  value  problems 
for  parabolic  and  hyperbolic  equations.  Finally,  in  Section  5 we  summarize 
very  briefly  our  conclusions. 

Throughout,  we  have  tried  to  take  as  broad  a view  as  possible,  but  since 
our  own  experience  is  primarily  with  the  CDC  STAR- 100,  many  examples  and 
details  are  necessarily  given  in  terms  of  this  machine. 

2 . Summary  of  Vector  Hardware 

For  the  purposes  of  this  paper,  we  will  consider  a vector  computer  as 
one  capable  of  operating  on  the  contents  of  a collection  of  memory  locations, 
known  as  a vector,  with  a single  hardware  instruction.  The  architecture  of 
a specific  computer  dictates  the  amount  of  generality  permitted  in  defining 
the  elements  of  a vector.  For  example,  as  we  will  see  below,  some  computers 
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allow  vector  operations  on  memory  locations  which  are  a constant  multiple 
apart  while  others  permit  them  only  on  contiguous  memory  locations.  We 
will  discuss  in  this  section  the  basic  characteristics  of  the  ILLIAC  IV, 
the  CDC  STAR-100,  the  TI  ASC,  and  the  CRAY-1,  all  of  which  have  hardware 
vector  operations.  The  latter  three  computers  are  discussed  in  detail  in 
Ramamoorthy  and  Li  [1977].  We  have  specifically  excluded  from  considera- 
tion computers  like  the  CDC  7600  which  execute  vector  operations  by  means  of 
software  aids  such  as  STACKLIB  (see  e.g.  McMahon,  Sloan  and  Long  [1972]). 

ILLIAC  TV.  The  present  manifestation  of  the  original  design  for  the 
ILLIAC  IV  (Barnes  et  al.  [1968])  has  64  processing  elements  (PE's)  each 
capable  of  executing  the  same  instruction  at  the  same  time  (Bouknight  et  al. 
[1972]).  Each  PE  has  2048  64  bit  words  of  local  memory.  The  main  memory 
of  the  ILLIAC  consists  of  a fixed  head  disk  with  a capacity  of  about  16 
million  words.  The  disk  has  a rotation  period  of  40  milliseconds  and  is 
capable  of  transfer  rates  of  about  960  million  bits  per  second.*  Consequently, 
the  overriding  consideration  in  using  the  ILLIAC  IV  is  to  formulate  the 
problem  and  choose  algorithms  which  permit  predictable  transfers  of  large 
blocks  of  data  to  and  from  the  disk.  The  data  transfer  problem  is  eased  by 
a PE  interconnection  pattern  of  an  8 x 8 grid  that  permits  the  movement  of 
data  in  a PE  to  any  one  of  its  immediate  north,  south,  east  or  west  neighbors. 
In  addition  each  PE  on  the  boundary  of  the  grid  is  connected  to  the  corres- 
ponding PE  on  the  opposite  boundary  (see  Figure  1). 

*Secondary  storage  on  the  STAR-100,  ASC  and  CRAY-1  consists  of  multiple 
disk  storage  units  with  similar  characteristics.  In  all  cases  the  disk 
capacity  is  on  the  order  of  10^  bits,  the  latency  is  approximately  15  msec, 
and  the  transfer  rate  is  in  excess  of  30  x 10^  bits  per  second. 
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Figure  1.  PE  interconnection  pattern  for  the  ILLIAC  IV 

For  the  ILLIAC  IV  a vector  consists  of  up  to  64  elements  with  each  element 
in  the  memory  of  a different  PE.  Operating  on  vectors  of  this  type,  the  com- 
puter is  capable  of  over  40  million  floating  point  operations  per  second 
(MFLOPS).  Using  32-bit  mode  this  rate  nearly  doubles. 

STAR- 100.  The  CPU  of  the  STAR-100  consists  of  two  pipelines  which  are 

configured  via  microcode  to  execute  the  appropriate  instruction  (Control 
Data  Corp.  [1975]).  A vector  instruction  initiates  the  flow  of  operands 
from  memory  to  the  pipeline.  Assuming  that  the  instruction  involves  two 
source  vectors,  each  segment  of  the  pipeline  accepts  two  elements,  performs 
its  particular  function  (e.g.  coefficient  alignment  or  exponent  adjustment), 
passes  the  result  to  the  next  segment,  and  receives  the  next  two  elements 
from  the  stream  of  operands.  When  the  result  of  the  operation  emerges  from 
the  pipeline,  it  is  returned  directly  to  memory.  For  more  details  on  pipeline 
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architecture  see,  for  example.  Stone  [1975].  There  is  some  overhead  in 
initiating  a vector  instruction  but  once  this  is  overcome,  the  pipeline 
produces  a result  about  every  minor  cycle  or  clock.  Thus  a timing  formula 
in  minor  cycles  for  vector  instructions  has  the  form 

(2.1)  T = S + an 

where  S is  the  overhead,  frequently  called  the  start-up  time,  1/a  is  the 
number  of  results  per  minor  cycle  emerging  from  the  pipeline,  and  n is  the 
length  of  the  vector.  For  the  STAR-100,  S = 0(100)  and  a = 0(1).  We  note 
that  n should  actually  be  replaced  in  (2.1)  by  8|^|  which  implies  that 
vectors  whose  length  is  a multiple  of  8 are  the  most  efficient  to  work  with. 
But,  for  simplicity,  we  will  assume  the  timing  formula  is  as  given  by  (2.1). 

To  see  more  graphically  the  effect  of  start-up  time,  we  give  in  Figure  2 
a plot  of  the  results  per  microsecond  for  64-bit  addition  on  the  STAR-100  as 

I 

a function  of  the  vector  length  n.  Here,  S = 71,  a = k,  and  the  minor 
cycle  time  is  .04  Usee  so  that,  from  (2.1),  the  number  of  results  per  micro- 
second is  given  by  50n/(142+n),  with  the  asymptotic  limit  equal  to  50  MFLOPS 


0 1 10  100  1000  10000  n -*■ 


Figure  2.-  Results  per  microsecond  for  the  addition  of  two  64 
bit  vectors  of  varying  length  on  the  STAR-100. 
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From  Figure  2,  it  is  evident  that  even  for  vectors  of  length  100, 
the  result  rate  is  only  40%  of  that  possible  and  that  vectors  of  length 
almost  10,000  are  needed  before  the  start-up  overhead  becomes  completely 


negligible.  Note  also  that  there  is  no  sharp  discontinuity  in  result 
rates  as  on  ILLIAC  IV  where,  for  example,  operations  on  vectors  of  length 
65  would  take  twice  as  long  as  for  vectors  of  length  64,  since  there  are 
only  64  PE's. 

The  memory  of  the  STAR-100  consists  of  up  to  one  million  64  bit  words. 

A vector  is  limited  to  contiguous  memory  locations.  This  restriction  is 
eased  somewhat  by  the  rich  instruction  set  which  contains  many  operations 
for  the  manipulation  of  memory;  however,  as  will  bt;  seen  later,  the  cost 
of  these  operations  can  become  a significant  percentage  of  the  execution 
time  for  a given  algorithm. 

An  interesting  feature  of  the  STAR-100  is  the  ability  to  do  floating 
point  operations  on  halfwords  of  32  bits.  This  has  the  effect  of  doubling 
the  memory  size  and  more  than  doubling  the  result  rate.  The  STAR-100  can 
achieve  over  100  MFLOPS  in  32  bit  mode. 

We  note  that  one  shortcoming  of  the  current  STAR  hardware  - which 
will  be  considerably  improved  in  the  forthcoming  100A  - is  the  rather  slow 
time  for  scalar  operations.  The  precise  timings  are  difficult  to  ascertain 
since  much  depends  on  register  conflicts,  etc.  but  an  average  floating  point 
operation  time  seems  to  be  approximately  1 Psec,  which  is  roughly  the 
same  as  the  CDC  6600. 

ASC.  The  CPU  of  the  ASC  consists  of  one,  two  or  four  pipelines  which, 
as  in  the  STAR-100,  are  configured  for  the  desired  operation  (Texas  Instru- 
ments [1974]).  A vector  instruction  causes  the  operands  to  move  from  memory, 
through  the  pipelines  and  back  to  memory.  The  timing  of  vector  instructions 
is  given  by  equation  (2.1),  and  as  with  the  STAR-100,  S = 0(100)  and 
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a = 0(1)  although  for  one  or  two  pipeline  versions  S may  be  in  the  range 
of  25  to  50  (see  Texas  Instruments  [1974]). 

A vector  on  the  ASC  is  the  contents  of  a collection  of  memory  locations 
which  can  be  referenced  by  a triply  nested  FORTRAN  DO  loop  where  the  inner 
most  loop  has  an  increment  of  one;  however,  there  may  be  a degradation  in 
speed  if  the  locations  are  not  contiguous.  This  makes  possible  an  interesting 
feature  of  the  computer:  it  can  translate  an  operation  in  DO  loops  nested 
up  to  three  deep  into  a single  hardware  instruction.  For  example 

DO  1 I = 2,  11 

DO  2 J = 1,  20,  2 
DO  3 K = 10,  1,  -1 

A (I , J ,K)  = B(I,J,K)  + C (I , J ,K) 

3 CONTINUE 

2 CONTINUE 

1 CONTINUE 

becomes  one  vector  add  instruction  for  vectors  of  length  1000. 

The  importance  of  this  is  that  there  is  only  one  start-up  required  in 
such  a situation.  As  we  will  show  later,  this  can  have  a pronounced  effect 
on  the  performance  of  certain  algorithms. 

Memory  on  the  ASC  is  available  with  up  to  16  million  32  bit  words.  A 
four  pipeline  version  of  the  computer  is  capable  of  performance  in  the  range 
of  50  MFLOPS.  As  with  the  STAR-100  scalar  arithmetic  is  relatively  slow. 

The  average  time  for  a 32  bit  floating  point  operation  on  a one  pipeline 
computer  appears  to  be  about  .5  y sec.  This  can  decrease  by  as  much  as  a 
factor  of  four  with  the  addition  of  more  pipelines  and  assuming  an  instruction 
mix  that  can  be  executed  in  parallel. 
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CRAY- 1 . The  CPU  of  the  CRAY-1  contains  twelve  functional  units  which 


are  segmented  like  the  pipelines  of  the  ASC  and  STAR-100;  (Cray  Research, 
Inc.  [1976]).  Unlike  these  pipelines,  however,  the  functional  units  are 
not  reconf igurable ; each  functional  unit  executes  a subset  of  the  instruc- 
tion set.  For  performing  vector  operations  the  CRAY-1  has  eight  64  word 
vector  registers.  A vector  is  defined  as  the  contents  of  consecutive  ele- 
ments of  one  of  the  vector  registers,  always  beginning  with  the  first  ele- 
ment of  that  register.  The  vector  registers  are  loaded  from  memory,  which 
consists  of  up  to  one  million  64  bit  words,  and  their  contents  returned  to 
memory  by  vector  load  and  store  operations  Thus,  as  on  the  ILLIAC  IV,  a 
vector  on  the  CRAY-1  is  less  than  or  equal  to  64  words  with  operations  on 
longer  strings  being  done  in  increments  of  64.  Unlike  the  ILLIAC  IV  there 
may  be  little  penalty  for  processing  a vector  of  length  65  versus  one  of 
length  64.  If  a register  is  available,  the  element  is  loaded  and  the  opera- 
tion begins  in  the  next  cycle  after  the  operation  on  the  64th  element  began. 
There  may,  however,  be  conflicts  which  degrade  the  performance  from  this 
ideal  situation. 

Two  interesting  features  of  the  functional  units  are  that  they  may 
operate  independently  and  in  parallel,  and  that  they  may  be  "chained." 

The  latter  feature  means  that  the  results  of  one  functional  unit  may  be 
used  as  input  to  another  immediately  without  returning  to  the  vector  regis- 
ters. These  are  particularly  useful  and  powerful  capabilities  making  it 
possible  for  the  CRAY-1  to  achieve  in  excess  of  160  MFLOPS. 

An  outstanding  feature  of  the  CRAY-1  relative  to  the  other  vector  com- 
puters is  its  fast  scalar  arithmetic.  With  a cycle  time  of  12.5  ns,  the 
CRAY-1  is  more  than  twice  as  fast  on  scalar  operations  as  the  CDC  7600 
(see  Keller  [1976]). 
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3.  Direct  Methods  for  Elliptic  Boundary  Value  Problems 


Most  of  the  work  that  has  been  done  to  date  on  the  solution  of  elliptic 
boundary  value  problems  on  vector  computers  has  centered  around  the  solu- 
tion of  the  algebraic  system  that  results  from  discretizing  the  partial 
differential  equation.  Our  discussion  will  reflect  this  emphasis,  and  in- 
clude both  direct  and  iterative  methods  for  solving  the  algebraic  systems. 

If  the  discretization  is  by  finite  difference  methods,  then  there  is 
little  computational  work  involved  in  generating  the  algebraic  system; 
however,  if  finite  element  methods  are  used,  there  may  be  considerable 
effort  required,  and  it  is  not  clear  that  this  computation  is  well  suited 
for  vector  computers.  Essentially  the  only  paper  that  has  addressed  this 
point  is  Noor  and  Fulton  [1975]  who  consider  both  the  generation  as  well 
as  the  solution  on  the  STAR-100  of  a finite  element  structural  analysis 
system.  A procedure  for  generating  the  element  stiffness  matrix  that 
utilizes  vector  operations  is  given;  a timing  estimate  of  this  process  in- 
dicates a speedup  of  at  least  a factor  of  5 over  the  CDC-6600.  The  algo- 
rithm for  factoring  the  linear  system  is  similar  to  the  banded  ones  described 
below. 

In  this  section  we  will  first  consider  algorithms  for  factoring  the 
coefficient  matrix,  assuming  that  some  natural  ordering  of  the  grid  points 
is  used.  Next,  we  treat  different  orderings,  especially  those  of  one-way 
and  nested  dissection.  Finally,  we  consider  special  methods  for  tridiagonal 
systems,  and  fast  Poisson  solvers  based  on  the  fast  Fourier  transform. 

Factorization  Methods.  We  make  no  specific  assumptions  about  the 
differential  equation,  the  domain  or  the  discretization  but  assume  only  that 
the  discrete  equations 

(3.1)  , Ax  = b 
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are  such  that  the  matrix  A is  symmetric,  positive  definite  and  banded, 
with  bandwidth  8(A)  defined  by 

8(A)  5 max  | i- j | . 


We  note  that  although  A is  banded  it  would  be  inefficient  in  many  situa- 
tions not  to  take  advantage  of  its  profile  structure  (see  e.g.  Jennings 

[1966])  but  we  shall  not  discuss  this  here.  We  consider  the  Cholesky  de- 
T 

composition  A = LL  where  L is  lower  triangular  and  no  attempt  is  made 

to  exploit  the  zeros  within  the  band  since  the  band  itself  almost  completely 

fills  during  the  factorization.  The  solution  of  (3.1)  is  then  obtained 

T 

by  solving  the  systems  Ly  = b and  L x = y.  Algorithms  are  -’iscussed 

for  three  computers  — the  STAR-100,  the  ASC,  and  the  CRAY-! . 

In  Lambiotte  [1975]  several  algorithms  for  performing  the  factori- 
zation of  A were  analyzed  for  a virtual  computer  patterned  after  the 
STAR-100.  The  following  algorithm  is  a variant  of  one  which  was  shown 
to  be  among  the  fastest.  The  lower  half  of  the  matrix  A is  assumed  to 
be  stored  by  columns,  and  the  factor  L and  the  modifications  to  A are 
developed  a column  at  a time  using  the  appropriate  linear  combinations 
of  the  columns  of  A.  For  example,  the  following  vector  operations  are  in- 
volved in  the  modification  of  column  i + j of  A required  when  computing 
column  i of  L: 


(ai+j,i+j’-"’ai+8,i+j)  (ai+j,i+j* 


’ ai+8 , i+j ) 


(3.2) 


i+j,i 


*(ai+j,i,-*-,ai+S,i) 
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(The  algorithm  is  not  implemented  in  precisely  the  form  given  in  (3.2);  for 
details  the  reader  is  referred  to  Knight,  Poole,  and  Voigt  [1975]  and  George, 
Poole,  and  Voigt  [1976b].)  Expression  (3.2)  must  be  executed  for  j ranging 
from  1 to  8;  hence,  the  vector  lengths  for  the  algorithm  vary  from  1 to 

p | ^ 

6 with  an  average  of  — — . In  George,  Poole,  and  Voigt  [1976b]  a precise 

timing  formula  is  given  which,  omitting  scalar  arithmetic  times,  is  approxi- 
mately 

(3.3)  T (N , 8 ) ~ + 232N8  + lower  order  terms 

A 

STAR-100  cycles.  Note  that  the  large  coefficient  of  the  N8  term  causes 
that  term  to  dominate  the  timing  for  all  but  very  large  problems.  Most  of  the 
cycles  of  that  coefficient  are  attributable  to  vector  start-up  times;  as 
will  be  seen,  this  is  in  sharp  contrast  with  the  ASC  where,  by  choosing  a 
different  algorithm,  we  will  be  able  to  reduce  the  influence  of  the  start- 
up times  to  lower  order  terms.  Note  that  even  for  large  problems,  if  the 
bandwidth  is  small  the  algorithm  will  be  very  inefficient  for  it  requires 
a large  number  of  vector  operations  of  short  lengths.  This  effect  reaches 
its  extreme  in  solving  tridiagonal  systems,  a topic  we  will  discuss  later. 

In  Calahan,  Joy,  and  Orbits  [1976]  it  was  noted  that  the  correct  implemen- 
tation on  the  ASC  of  an  LU  factorization  of  a full  n x n matrix  yields  a 

timing  formula  where  the  start-up  times  contribute  to  the  0(n)  term  rather  than 
2 

the  n term  as  in  the  STAR-100.  In  Voigt  [1977]  that  same  improvement  is 

obtained  in  the  banded  case  by  using  an  algorithm  whose  timing  has  the  start-up 

time  contributing  to  the  N term  rather  than  the  N8  term  as  with  the  STAR-100. 

In  essence  we  must  store  an  additional  band  of  zeros, nearly  doubling  the  storage 

2 

required  by  the  above  algorithm,  and  the  operations  have  constant  length  8 , 

O i -I 

some  of  which  are  with  zeros,  rather  than  average  length  — . Assuming  row 
by  row  storage,  the  key  operations  at  the  k*"*1  step  of  the  factorization  are 
given  by: 
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DO  10  J=K+1,  K+8 


r 

DO  20  I = K - 6,  K - 1 

I 

TEMP  = TEMP  + A(K, I)  * A(J,I) 

20  CONTINUE 
10  CONTINUE 

This  nested  DO  loop  can  be  executed  on  the  ASC  in  one  inner  product 

2 

instruction  on  vectors  of  length  6 rather  than  28  instructions  of  average 

g . 1 

length  — — on  the  STAR-100.  The  timing  formula  for  the  factorization  is  then 
approximately 

(3.4)  T(N,8)  ~ N82  + t|~N8  + 485 N + lower  order  terms 

ASC  cycles.  This  formula  assumes  no  degradation  in  performance  for  non-contiguous 
vector  elements. 

The  usefulness  of  this  algorithm  clearly  depends  on  N and  8 since  we  have 

2 

doubled  the  coefficient  of  the  N 8 term  in  order  to  remove  the  start-up  time 
influence  to  the  0(N)  term.  This  tradeoff  is  discussed  in  more  detail  in 
Voigt  [1977].  That  paper  also  includes  a discussion  of  the  influence  of  indi- 
vidual operation  times  on  the  two  machines,  including  the  impact  cf  a slow  inner 
product  on  the  STAR-100  versus  a fast  one  for  the  ASC. 

We  now  turn  our  attention  to  the  CRAY  where  a different  architectural 
characteristic  is  dominant.  As  was  mentioned  in  Section  2,  all  vector  instruc- 
tions obtain  their  operands  from  the  vector  registers;  thus  these  registers 
must  be  loaded  from  memory  and  this  can  have  a major  impact  on  the  effective 
computational  rate  as  has  been  noted  by  Calahan,  Joy  and  Orbits  [1976]  and  Fong 
and  Jordan  [1976].  To  be  specific,  there  is  one  path  between  memory  and  the 
vector  registers  capable  of  transferring  one  operand  or  result  per  clock. 

With  the  functional  units  capable  of  performing  an  add  and  multiply  every 
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clock,  the  potential  problem  is  clear.  For  example,  if  the  STAR-100 
algorithm  given  in  expression  (3.2)  were  implemented,  it  is  shown  in 
Voigt  [1977]  that  the  arithmetic  units  would  be  busy  only  half  of  the 
time  because  a load  and  a store  are  required  for  each  multiply-subtract 
combinat ion . 

By  redesigning  the  algorithm  it  is  possible  to  keep  both  arithmetic  units 
busy.  Following  the  spirit  of  algorithms  presented  in  Calahan,  Joy  and 
Orbits  [1976]  and  Fong  and  Jordan  [1976],  this  is  accomplished  by  completely 
modifying  the  k+lSt  column  of  the  matrix  at  the  k*1*1  step  of  the  factori- 
zation and  leaving  all  other  columns  unchanged.  The  key  portion  of  the  algo- 
rithm is  given  below  where  j = 1,...,B  : 


(3.S)  (ak+l  ,k  + l ’ ' ' ' ,ak+j  ,k  + l ) + (ak  + l , k+1 * • • ' ,ak+j  , k+1' 


ak-8  + j , k+1  * (ak+l  ,k-B  + j ’ ‘ ‘ ' ,ak  + j ,k-8  + j ) 

s t 

Further  details  may  be  found  in  Voigt  [1977]  but  note  that  the  k+1 
column  need  not  be  stored  until  the  loop  on  j is  completed.  Thus  the 
arithmetic  units  can  be  kept  busy  essentially  throughout  the  factorization. 

In  Calahan,  Joy  and  Orbits  [1976]  algorithms  similar  to  those  given  in 
expressions  (3.2)  and  (3.5)  for  factoring  a dense  matrix  are  compared  with  runs 
on  a CRAY-1  computer  and  the  times  reported  there  bear  out  the  contention  that 
algorithm  (3.5)  is  twice  as  fast  as  algorithm  (3.2).  In  Orbits  and  Calahan 
[1976]  a block  method  is  discussed  for  factoring  a dense  system  which  requires 
about  of  the  memory-register  transfers  that  are  required  of  the  best  algorithm 
discussed  above.  The  algorithm  is  very  complicated  to  implement  on  the  CRAY-1 
and  will  not  be  discussed  here  since  we  have  already  considered  one  which  is 
limited  by  arithmetic  unit  speed  rather  than  memory-register  transfer  speed. 
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This  block  algorithm,  as  well  as  its  usefulness  in  the  banded  case,  also  is 
discussed  in  Fong  and  Jordan  [1976], 

The  previous  discussion  applies  to  arbitrary  banded  systems  of  equations 
and,  in  particular,  to  the  discrete  forms  of  elliptic  equations  in  which  the 
ordering  of  the  node  points  is  such  as  to  give  rise  to  a suitably  banded 
system.  We  now  consider  other  node  ordering  schemes  that  have  been  shown 
to  reduce  the  number  of  arithmetic  operations  required  for  factoring 
the  matrix  of  the  resulting  linear  system.  The  first  of  these  is  known 
as  one-way  dissection  and  is  discussed  in  detail  in  George  [1972,  1977]. 
Referring  to  Figure  3,  the-  idea  is  first  to  divide  the  grid  with  X, 
horizontal  separators.  The  nodes  in  the  £.+1  remaining  rectangles  are 
numbered  vertically  toward  a separator  and  then  the  separators  are  numbered. 


Figi  3.  An  n by  n mesh  dissected  into  A blocks  with 
the  ordering  indicated  by  the  circled  numbers 
and  the  arrows. 

For  the  proper  choice  of  i this  ordering  has  been  shown  to  reduce  the 

4 

number  of  arithmetic  operations  required  for  the  factorization  from  (7(n  ) 

1 

for  the  natural  ordering  to  0(n  2)  (see  George  [1972]). 

The  nested  dissection  ordering  further  reduces  the  operation  count 

3 

to  0(n  ) as  shown  in  George  [1973,  1977].  The  idea  here  is  to  divide 
the  grid  with  both  horizontal  and  vertical  separators  as  shown  in  Figure  A. 
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Figure  4.  One  step  of  the  nested  dissection 
ordering  for  the  n by  n grid. 

Now  regions  1-4  may  be  numbered  in  the  usual  way  or,  since  they  are 
again  squares,  they  may  be  numbered  using  horizontal  and  vertical  separa- 
tors. Clearly  the  idea  may  be  applied  recursively,  and  in  the  case 
n **  2 -1,  dissection  will  terminate  after  k-1  steps.  In  order  to  obtain 

3 

the  0(n  ) operation  count,  dissection  must  be  carried  to  completion;  how- 
ever as  noted  in  George,  Poole,  and  Voigt  [1976a],  there  are  advantages  to 
stopping  the  dissection  early. 

Nested  dissection  for  vector  computers  was  first  discussed  by  Calahan 
[1975],  who  considered  rather  general  rectangular  finite  elements.  The 
paper  includes  estimates  of  the  number  of  vector  operations  required  for  the 
factorization  and  their  average  lengths  assuming  dissection  is  carried  to 
completion.  Based  on  this  information  it  is  concluded  that  nested  dis- 
section would  be  attractive  on  a vector  computer  such  as  the  STAR-100  or 
the  ASC.  The  idea  of  stopping  the  dissection  early  is  not  considered. 

The  appropriate  level  of  dissection  becomes  an  interesting  question 
for  a vector  computer.  We  have  already  seen  that,  because  of  start-up  times, 
it  is  desirable  to  work  with  vectors  whose  length  is  as  great  as  possible; 
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however,  it  should  be  clear  from  Figures  3 and  4 that  as  dissection  con- 
tinues at  least  some  of  the  vectors  get  shorter.  This  phenomenon  is 
studied  in  detail  in  George,  Poole,  and  Voigt  [1976b],  and  there  are  two 
results  worth  mentioning.  First,  as  one  would  expect,  efficient  use  of 
vector  computers  dictates  less  dissection  than  implied  by  the  serial 
operation  count.  For  example,  on  computers  for  which  the  ratio  of  start-up 
time  to  result  time  is  similar  to  that  of  the  STAR-100  or  the  ASC,  the  minimum 
time  for  a given  factorization  is  obtained  by  stopping  nested  dissection 
two  levels  from  completion. 

Secondly,  both  the  one-way  and  nested  dissection  algorithms  trans- 
late almost  entirely  into  vector  operations;  however,  in  spite  of  a lower 
operation  count,  one-way  dissection  actually  introduces  more  vector  opera- 
tions than  are  present  in  the  natural  ordering  and  this  results  in  the 
natural  ordering  being  superior  for  all  but  very  large  n.  For  example, 
on  the  STAR-100  the  crossover  point  does  not  occur  until  n is  approxi- 
mately 650  whereas  in  the  scalar  case  one-way  dissection  is  superior  for 
n 30.  In  contrast,  nested  dissection  not  only  reduces  the  operation  count 
but  it  can  also  be  implemented  with  fewer  vector  operations  than  the  natural 
ordering.  Further  details  on  this  comparison  are  also  given  in  Voigt  [1977]. 

In  principle,  both  dissection  algorithms  would  be  attractive  on  the 

CRAY-1  and  the  ILLIAC  IV  since  neither  computer  is  burdened  with  vector 

start  ups.  However,  both  computers  are  limited  by  their  I/O  capabilities  and 

we  have  already  seen  how  this  can  limit  the  effectiveness  of  the  CRAY-1. 

As  was  shown  in  George  [1972,  1973],  one-way  and  nested  dissection  reduce 

3 

the  storage  requirements  from  0(n  ) locations  for  the  natural  ordering 
- 2 

to  0(n 2 ) and  0(n  log n ) respectively.  Thus  it  would  appear  that  there 
would  be  less  I/O  demanded  by  the  orderings.  However,  to  date,  there  has 


been  no  careful  analysis  of  the  I/O  required  by  an  Implementation  of 
either  of  the  orderings  on  a computer  such  as  the  CRAY-1  or  the  ILLIAC  IV. 


One  difficulty  with  the  dissection  orderings  that  should  not  be  mini- 
mized is  that  they  are  at  best  difficult  to  apply  to  nonrectangular  do- 
mains. Although  it  is  usually  clear  how  to  proceed,  the  implementation  of 
an  automatic  pro"  for  performing  the  ordering  is  a difficult  problem 

in  computer  sof*"  A step  in  this  direction  has  been  taken  by 

George  and  Liu  [3>;  -.iventional  serial  computers,  but  the  techniques 

they  employ  do  not  seen,  to  be  very  well  suited  for  vector  computers.  In 
principle  an  irregular  domain  causes  no  great  difficulties  in  implementing 
the  natural  ordering;  however,  because  of  the  possible  irregular  band  struc- 
ture, it  may  be  desirable  to  use  the  so-called  profile  method  (see,  e.g. 
Jennings  [1966])  which  takes  advantage  of  the  leading  zeros  in  each  indi- 
vidual row  (or  column) . 

Another  factorization  technique,  called  the  hypermatrix  scheme,  was 
developed  by  von  Fuchs,  et  al.  [1972],  who  were  motivated  by  the  finite  ele- 
ment analysis  of  large,  complex  structures  which  produce  global  structural 
matrices  that  do  not  fit  in  the  main  storage  of  a computer.  The  actual 
factorization  is  simply  block  decomposition;  the  unique  feature  of  the 
scheme  is  that  a hierarchy  of  bit  matrices  is  used  to  manage  the  nonzero 
entries  of  the  matrix.  At  the  first  level  a bit  matrix  is  used  to  identify 
the  blocks  which  have  nonzero  entries  and  thus  must  be  used  in  the  compu- 
tation. For  large  problems  it  may  be  incovenient  or  impossible  to  keep 
even  the  bit  matrix  in  memory;  hence,  a second  bit  matrix  is  introduced 
whose  entries  indicate  blocks  of  the  first  bit  matrix  that  have  any  "ones" 
present.  This  may  be  continued  to  any  desired  depth.  In  Noor  and  Voigt 
[1975]  this  approach  is  analyzed  for  the  STAR-100  and  a vectorized  block 
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factorization  algorithm  is  given.  It  is  shown  that  the  ability  to  select 
the  block,  size  can  be  beneficial  not  only  to  the  structural  analyst,  but 
also  in  maximizing  the  efficiency  of  the  computer.  For  example,  the  block 
size  could  be  chosen  so  that  the  time  required  for  the  computations  on  the 
blocks  overlapped  the  time  required  to  bring  a new  block  in  from,  and  write 
an  old  block  out  to,  secondary  storage.  This  balance  of  I/O  and  computation 
is  known  to  be  a significant  problem  for  vector  computers  (see,  for  example, 
Knight,  Poole,  and  Voigt  [1975]  and  Orbits  and  Calahan  [1976]). 

We  have  mentioned  above  that  the  average  vector  lengths  occuring  in 
the  algorithms  for  factorization  of  a matrix  with  bandwidth  8 are  0(8). 

The  tacit  assumption  has  been  that  8 is  sufficiently  large  so  that  these 
operations  are  reasonably  efficient;  however,  that  is  not  always  the  case. 

For  example,  in  iterative  methods  such  as  ADI  or  SLOR  one  must  solve  a 
large  number  of  tridiagonal  systems,  that  is,  systems  with  8=1.  Since 
the  previous  algorithms  are  totally  inappropriate  for  this  problem  we  will 
now  discuss  some  different  algorithms  for  the  factorization  of  tridiagonal 
matrices. 

We  drop  our  assumption  of  the  symmetry  of  A and  consider  LU  factori- 
zations where  L is  unit  lower  triangular  and  U is  upper  triangular. 

The  usual  Gauss  elimination  algorithm  for  tridiagonal  matrices  is  inherently 
serial.  For  example,  if  the  iC^  row  of  A is  (0, •••,()  c^,  a^,  b^,0, •••,()) 
the  itb  element  of  the  diagonal  of  U is  given  by  the  recursion  formula 


~ cibi-i/ui-l  • 


Since  u^  depends  on  u^_^,  this  cannot  be  directly  computed  using  vector 
operations. 

In  Stone  [1973a]  it  was  noted  that  the  recurrences  of  Gauss  elimination 
could  be  reformulated  into  products  of  two  x two  matrices  and  that  these 
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products  could  be  evaluated  with  vector  operations  using  a technique  known 

[ i 

are  recursive  doubling.  Recursive  doubling  in  the  simplest  case  expresses 
the  2 i*"*1  component  of  the  product  in  terms  of  the  I*’*1;  thus  for  n = 2^ 
the  nt*1  component  can  be  computed  in  log^n  = k steps. 

As  discussed  in  Lambiotte  and  Voigt  [1975],  this  method  is  suitable  for 
a computer  such  as  ILLIAC  IV  but  not  for  all  other  vector  computers  because 
for  them  each  arithmetic  operation  costs  some  unit  of  time,  and  Stone's 
algorithm  requires  0(n  log^n)  arithmetic  operations  as  opposed  to  0(n) 
for  the  usual  scalar  algorithm.  Thus,  in  the  terminology  of  Lambiotte  and 
Voigt  [1975],  Stone's  algorithm  is  inconsistent. 

In  Stone  [1975]  and  Lambiotte  and  Voigt  [1975],  various  methods  were 
analyzed,  the  latter  concentrating  on  implementations  for  the  STAR-100.  The 
conclusion  of  both  of  these  studies  was  that  cyclic  reduction  was  the  best 
method  for  vector  computers  for  matrices  of  sufficient  size,  say  n > 150. 

Cyclic  reduction  was  originally  proposed  by  G.  Golub  and  R.  Hockney  and  is 
discussed  in  Hockney  [1965]  for  tridiagonal  systems  arising  from  the  five-point 
star  for  Poisson's  equation.  Subsequently,  several  authors  including  Hockney  [1970] 
and  Ericksen  [1972]  pointed  out  that  the  algorithm  could  also  be  adapted  to 
general  tridiagonal  systems.  The  idea  is  to  eliminate  the  odd  numbered  vari- 
ables in  the  even  numbered  equations  by  performing  elementary  row  operations. 

The  details  may  be  found  in  Lambiotte  and  Voigt  [1975];  the  important  points 
are  that  the  operations  may  be  performed  on  vectors  defined  by  the  diagonals 
of  the  matrix,  and  that  the  resulting  system  is  again  tridiagonal  but  only 

k 

half  as  large.  The  process  may  be  continued  until,  in  the  case  that  n = 2 -1, 
only  one  equation  remains;  then  all  of  the  unknowns  are  recovered  in  a back 
substitution  process.  Lambiotte  and  Voigt  [1975]  show  that  this  process  re- 
quires only  0(n)  operations  and  is  thus  consistent,  but  implementation 
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of  the  algorithm  on  vector  computers  can  suffer  from  data  rearrangement 
overhead  that  is  as  costly  as  the  arithmetic  operations.  For  example, 
on  the  STAR-100  one  cannot  operate  with  every  other  element  of  a vector  as 
on  the  ASC  (see  Boris  [1976]);  thus  extra  operations  must  be  employed  to 
reformat  those  elements  into  a new  vector.  By  considering  the  appropriate 
timing  formulas  it  is  easy  to  see  that  this  overhead  accounts  for  approxi- 
mately half  of  the  total  operations.  The  overhead,  as  well  as  start  up 
time,  cause  the  algorithm  to  run  slower  on  the  STAR-100  than  a carefully 
coded  scalar  version  for  small  values  of  n.  This  is  reflected  in  Table  1 
which  also  includes  times  on  a CDC  CYBER-175*  using  the  FTN  compiler. 


n 

STAR- 100 

Gauss  Elimination 
assembly  language 

STAR-100 

Cyclic  Reduction 
assembly  language 

CDC  CYBER-175 

Gauss  Elimination 
FTN  (opt.  level  1) 

50 

560  M sec 

1039  y sec 

380  y sec 

100 

1075 

1337 

720 

150 

1590 

1612 

1100 

200 

2160 

1732 

1400 

300 

3141 

2126 

2100 

500 

5204 

2527 

3400 

1000 

11070 

3690 

6900 

Table  1. 


Another  comparison  is  given  in  Madsen  and  Rodrigue  [1976]  where  a 
CDC-7600  is  compared  with  the  STAR-100.  The  authors  discuss  a polyalgorithm 
for  the  STAR-100  in  which  cyclic  reduction  is  used  until  the  matrix  is  reduced 
in  size  to  the  point  where  it  is  more  efficient  to  solve  by  Gaussian  elimina- 
tion. Using  this  polyalgorithm  they  show  that  the  STAR-100  is  superior  to 

* The  functional  units  of  a CYBER  175  are  essentially  the  same  speed  as  those 
of  a CDC  7600  but  the  memory  is  slower. 
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the  7600  only  after  n > 750.  Their  work  also  includes  a comparison  with 
an  inconsistent  algorithm  proposed  by  Jordan  [1974];  as  expected  it  is 
slower  than  the  Gaussian  elimination-cyclic  reduction  polyalgorithm. 

Recently  Swarztrauber  [1976]  has  proposed  a consistent  tridiagonal  scheme 
based  on  Cramer's  rule.  The  operation  count  is  slightly  higher  than  that 
of  cyclic-reduction  but,  as  opposed  to  cyclic  reduction,  the  algorithm  is 
well  defined  mathematically  for  general  non-singular  tridiagonal  matrices. 
Thus  it  might  have  merit  in  those  situations  where  pivoting  would  be  re- 
quired for  the  stability  of  cyclic  reduction.  However,  a stability  analy- 
sis of  Swarztrauber ’ s algorithm  has  yet  to  be  done. 

A stable  tridiagonal  solver  based  on  Given's  transformations  has  been 
given  by  Sameh  and  Kuck  [1976]  for  a theoretical  parallel  computer.  The 
algorithm  is  inconsistent  but  slightly  more  efficient  than  the  one  given 
by  Stone  [1973a].  The  stability  analysis , which  is  contained  in  Sameh  and 
Kuck  [1977],  is  based  on  the  recurrence  relations  that  arise  in  the  algorithm 
and  the  same  approach  may  be  used  to  show  that  Stone's  algorithm  is,  in 
general,  unstable  (see  Dubois  and  Rodrigue  [1976]).  Thus,  for  the  ILLIAC  IV 
or  in  the  situation  where  the  stability  of  cyclic  reduction  cannot  be  guaran- 
teed, the  algorithm  based  on  Given's  transformation  may  have  merit. 

Because  of  their  inherent  parallelism,  iterative  methods  have  been 
considered  by  Traub  [1974]  for  solving  tridiagonal  problems,  and  further 
studied  by  Lambiotte  and  Voigt  [1975]  and  Heller,  Stevenson,  and  Traub 
[1976].  Except  for  certain  specialized  situations  where  an  excellent  star- 
ting value  need  only  be  improved  by  a few  digits,  these  methods  do  not 
appear  to  be  competitive  with  direct  methods  such  as  cyclic  reduction. 

There  are,  of  course,  matrices  of  interest  whose  bandwidth  is  too  small 
for  efficient  use  of  banded  solvers  on  vector  computers,  but  is  more  than 
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one.  There  appear  to  be  no  good  algorithms  available  for  this  problem. 
Rodrigue,  Madsen,  and  Karush  [1976]  have  proposed  an  extension  of  cyclic 
reduction  to  matrices  of  arbitrary  bandwidth  where  the  vector  operations 


are  on  the  diagonals  as  in  cyclic  reduction  for  tridiagonal  matrices.  Un- 
fortunately, the  numerical  stability  of  the  algorithm  remains  in  doubt; 
indeed,  even  reasonable  conditions  on  the  banded  matrix  that  guarantee 
that  the  algorithm  remains  well  defined  (i.e.  division  by  zero  cannot  occur) 
have  not  yet  been  given. 

We  now  turn  our  attention  briefly  to  fast  Poisson  solvers  based  on  the 
fast  Fourier  transform  of  Cooley  and  Tukey  [1965].  Buzbee  [1973]  shows 
that  for  the  five-point  difference  approximation  to  Poisson’s  equation  on 
an  M by  N rectangular  grid,  (M  < N) , a technique  known  as  matrix  decompo- 
sition reduces  the  block  tridiagonal  system  to  M independent  tridiagonal 
systems.  The  reduction  to  tridiagonal  systems  is  accomplished  via  the  fast 
Fourier  transform  and  the  systems  may  be  solved  in  parallel  on  any  of  the 
computers  discussed  here. 

As  noted  by  Pease  [1968],  the  fast  Fourier  transform  may  be  computed 
efficiently  on  parallel  computers.  His  algorithm  was  developed  for  an  array 
computer  with  an  interconnection  pattern  known  as  the  perfect  shuffle  (see 
Stone  [1971]),  but  Korn  and  Lambiotte  [1977]  have  shown  that  it  is  also 
effective  on  the  STAR-100.  Ackins  [1968]  and  Stevens  [1971]  developed  a 
fast  Fourier  transform  for  the  ILLIAC  IV  following  the  Cooley-Tukey  approach 
and  a similar  algorithm  is  compared  with  the  Pease  algorithm  on  the  STAR-100 
in  Korn  and  Lambiotte  [1977]. 

Finally,  Sameh,  Chen,  and  Kuck  [1976]  treat  Poisson's  equation  in  a 
similar  fashion  to  Buzbee  [1973]  on  a theoretical  array  computer  with  a large 
number  of  independent  processors.  They  also  discuss  the  biharmonic  equation, 
V^u(x,y)  = F(x,y),  and  show  that  techniques  used  in  the  Poisson  solver  result 
in  attractive  algorithms  for  this  equation. 
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4.  Iterative  and  Time  Marching  Methods 


We  turn  now  to  iterative  methods  for  elliptic  equations  as  well  as 
methods  for  parabolic  and  hyperbolic  equations.  The  reason  for  treating 
these  together  is  that  many  of  the  considerations  for  both  iterative  as 
well  as  time-marching  methods  on  vector  computers  are  very  similar,  if 
not  identical.  Explicit  methods  will  tend  to  be  relatively  more  attrac- 
tive than  on  serial  computers  because  of  their  usually  better  vectoriza- 
tion  properties,  but  this  will  not  necessarily  overcome  the  stringent 
stability  requirements  of  small  time  steps.  We  shall  illustrate  this 
shortly. 

The  question  of  implicit  versus  explicit  methods,  however,  is  only 
one  part  of  the  broader  consideration  of  how  much  of  the  method  can  be 
implemented  by  operations  on  vectors  of  a "good"  length.  Other  aspects 
which  affect  this  will  include  the  domain,  the  boundary  conditions  (and 
perhaps  computational  boundary  conditions  needed  for  hyperbolic  equations 
and/or  higher  order  methods),  the  form  of  the  coefficients  and  whether 
their  calculation  can  be  vectorized,  the  number  of  space  dimensions,  etc. 

Parabolic  and  Hyperbolic  Equations  in  One  Dimension 

We  will  begin  the  discussion  with  the  simple  parabolic  equation 

(4.1)  u=au  , t>0  , 0<x<l 

t xx 

with  constant  coefficient  a and  initial-boundary  conditions 

(4.2)  u(0,x)  = g (x) 

(4.3)  u(t,0)  = a , u(t,l)  = 6 

for  constant  a and  0. 
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Consider  first  the  standard  second-order  Crank-Nicolson  scheme 


(A. A) 


k+1  k _ U k+1  . k+r  k+1  k - k , k . . 

Uj  ' Uj  “ 2 (uj+r2uj  +uj-l+uj+r2uj+uj-l)’  1 = 


where 


(4.5) 


y = a 


At 


(Ax)' 


k k+1 

and  Uj  and  indicate  values  at  the  current  and  next  time  levels, 

respectively.  At  each  time  step,  a tridiagonal  system  of  equations  must 
be  solved  and,  as  we  saw  in  the  last  section,  this  is  not  particularly 
efficient  on  a vector  computer  with  the  algorithms  now  known.  By  contrast, 
the  simplest  explicit  method 


(4.6) 


k+1  k . k , k , k . , , 

u.  = u.  + y(u.,-2u.+u.  ,),  l = 1 , • • • ,N 
J J J+l  J J-l 


is  mechanistically  ideal  for  vector  computers,  being  carried  out  by  5 ope:  i- 
tions  on  vectors  whose  length  is  the  number  of  interior  grid  points.  Table  2 
gives  representative  CPU  times  per  step  for  (4.6)  for  the  CDC  STAR-100  and, 
for  comparison,  for  the  CDC  CYBER  175.  These  times  show  that  the  optimal 
speed  of  50  MFLOPS  is  almost  attained  for  vectors  of  length  N = 1000  but 
only  a fraction  of  this  for  N = 50;  in  the  latter  case,  almost  three- 
fourths  of  the  time  for  each  vector  operation  is  in  start-up  while  for 
N = 1000,  this  penalty  drops  to  about  -4.  Table  2 also  gives  the  CPU  time 

O 

for  a Crank-Nicolson  step,  solving  the  tridiagonal  systems  by  scalar 
Caussian  elimination  for  N = 50  and  by  odd-even  reduction  for 
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N = 1000  . We  see  that  the  time  ratio  per  step  of  the  Crank-Nicolson 
and  explicit  methods  is  about  20  for  N = 1000  and  14  for  N - 50.  On 
the  other  hand,  the  stability  requirements  for  the  explicit  method  are 

-4  c. 

2*10  ln-o 

At  i.  ^ for  N = 50»  At  1 -*55-  for  N » 1000 

which,  depending  on  the  accuracy  requirements,  may  well  cancel  out  the 
large  per  step  time  difference  in  favor  of  the  explicit  method. 


STAR 


175 


Explicit  (Equation  (4.6)) 


N = 50  43  Msec 

N = 1000  194  Msec 


Crank-Nicolson  (4.4)* 

N = 50  600  Msec 

N = 1000  3900  Msec 


Dufort-Frankel  (4.7) 

N = 50  42  Msec 

N = 1000  196  Msec 


150  Msec 
2500  Msec 

560  Msec 
11700  Msec 

165  Msec 
2800  Msec 


Table  2 

Time  per  Iteration  for  Various  Methods 


The  Jacobi  iteration  for  a tridiagonal  system  has  a form  similar  to 
(4.6)  and  could  be  applied  as  an  interior  iteration  for  the  Crank-Nicolson 


*The  tridiagonal  systems  were  solved  completely,  rather  than  saving  the  L,  U 
factors,  at  each  time  step  since  it  was  felt  that  this  was  more  representative 
of  realistic  problems.  Time  for  N = 50  is  using  Oaussian  elimination;  time 
for  N = 1000  is  using  odd-even  reduction. 
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or  another  implicit  method.  McCulley  and  Zaher  [1974]  have  reported 
reasonable  results  with  this  approach  for  a diffusion  problem  on  the 
ILLIAC  IV;  in  their  case,  15  Jacobi  sweeps  sufficed  at  each  time-step. 

The  issue  of  explicit  versus  implicit  methods  points  us  in  the 
direction  of  desiring  an  unconditionally  stable  explicit  method.  Per- 
haps the  simplest  such  method  is  that  of  Dufort-Frankel  (see,  e.g.  Richt- 
myer-Morton  [1967 ])which,  for  the  problem  (4.1)- (4. 3),  takes  the  form 


k+1  k-1  „ , k k+1  k-1  k , 

u.  = u.  + 2p (u . , , - u . -u.  + u . . ) 

J J J+l  J J J-l 


j = !,•••,  N 


or 


(4.7) 


k+1 


1-2P  k-1 

l+2u  Uj  + 


2U  / k , k ^ 

TBil ( Vi  J-l’ 


This  requires  only  4 vector  operations  per  time  step  although,  since  it 

k-1 

is  a two- level  scheme,  additional  storage  for  u is  required.  Also, 

this  scheme  is  inconsistent  with  (4.1)  and  care  must  be  taken  in  the  proper 

choice  of  At /Ax  to  obtain  a suitable  approximation  to  the  time-dependent 

solution.  Representative  times  per  iteration  for  this  method  are  also 

given  in  Table  2.  (The  fact  that  the  times  are  essentially  identical  to 

those  for  the  explicit  method  (4.6),  even  though  one  less  vector  arithmetic 

operation  is  required,  is  due  to  needing  two  vector  transmit  instructions 

instead  of  one  to  set  the  current  values  in  the  correct  arrays  before  the 

next  time  step.  Note  that  one  of  these  transmits  could  be  avoided  by  inter- 

k k-1 

changing  the  roles  of  u and  u at  each  time  step.) 

In  any  of  the  above  methods,  the  boundary  conditions  (4.3)  are  handled 
trivially  by  a one  time  insertion  into  the  first  and  last  positions  of  the 
vector  which  holds  the  approximation  at  the  current  time  level.  The  initial 
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1 


1 


condition  is  also  a one  time  calculation  and  can  sometimes  itself  be 
vectorized.  For  example,  if  g is  given  by  an  explicit  formula  such  as 

(4*8)  g(x)  » C^x  + C2x2 

then  the  calculation  of  the  initial  vector  can  be  done  by  the  vector 
operations 

(A. 9)  Cl*£  +c3*2L*2L 

where  x = (x^,***,x^)  is  the  vector  of  grid  points.  Here,  x * x 
indicates  the  component  by  component  product  which  is  a hardware  in- 
struction on  all  current  vector  machines  and  takes  essentially  the  same 
time  as  a scalar-vector  multiplication. 

So  far,  we  have  considered  the  most  favorable  circumstances:  constant 
coefficients,  constant  boundary  conditions,  and  one  space  dimension. 
Suppose,  now,  that  the  coefficient  a in  (4.1)  is  a function  of  x.  Then 
p in,  for  example,  (4.6)  is  a function  of  the  grid  points  and  the  imple- 
mentation of  (4.6)  requires  a component  by  component  multiplication  of  the 
vector 


(4.10)  JJ  = — 2 . •*  * .a^)) 

Ax 

with  the  second  difference  vector  of  u.  Thus,  if  storage  is  available, 
the  fact  that  a is  a function  of  x adds  only  a one  time  evaluation  of 
the  vector  p and  this  itself  may  perhaps  be  vectorized  in  the  manner  in- 
dicated for  the  initial  condition.  If  a were  also  a function  of  t,  then 
would  have  to  be  recomputed  at  each  time  step  and  the  ability  to  do  this 
by  vector  operations  would  become  much  more  important. 
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If  the  boundary  conditions  (4.3)  were  functions  of  t they  would  have 
to  be  computed  at  each  time  step,  presumably  by  scalar  operations  and  this 
would  add  an  additional  inefficiency  to  the  code. 

Let  us  turn  briefly  to  hyperbolic  equations  in  one  space  variable. 
Except  for  certain  "stiff"  hyperbolic  systems  (i.e.  systems  with  a wide 
range  of  eigenfrequencies  and  characteristic  phase  velocities)  implicit 
methods  are  rarely  used,  even  on  serial  computers,  and  we  will  limit  our 
attention  to  explicit  methods. 

The  vectorization  of  explicit  methods  follows  quite  closely  that  for 
parabolic  equations.  Consider,  for  example,  the  hyperbolic  system 

(4-n)  u + F(u)  =0  , 0 < x < 1 

— t — x — — 

with  initial  condition 


u(0,x)  = g(x) 


0 < x < 1 


and  suitable  boundary  conditons.  The  standard  two-step  Lax-Wendroff  scheme 
(see,  e.g.  Richtmyer  and  Morton  [1967])  is 


k+i 


ev'2_l/  K-  , ts..  ,„k  V 

+ V - ’i<VrV  • 


(4.12) 


k+1  k k+i  „ k -H  . 

- Y2(Fj4  -*}-V 


where  = At/Ax  and  = Y 2 /2. 

Assume  that  there  are  N equations  and  M interior  grid  points.  The 
simplest  approach  is  to  treat  each  component  of  the  system  (4.11)  separately 
in  (4.12)  and  vectorize  over  the  number  of  grid  points;  that  is,  let  U1,***,UN, 
UT1 , * * * , UTN,  and  F1,-**,FN  be  one  dimensional  arrays  of  length  M + 2 and 
use  the  notation  U(I;J)  to  denote  the  subarray  of  length  J starting  in 
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position  I.  Then  the  first  part  of  (4.12)  can  be  carried  out  by  the 


vector  instructions 


(4.13)  UT1 (1;M+1)  = . 5* (U1 (2;M+1)  + U1(1;M+1) ) - y1*(F1(2;M+1)  -Fl(  1;M+1)) 

and  similarly  for  UT2,***,UTN.  Next  we  need  to  evaluate  F at  the  points 

of  UT1,***,UTN  and  in  many  cases  this  can  be  done  primarily  also  by  vector 

instructions.  For  example,  suppose  that  u is  the  3-vector  of  density  p, 

2 

momentum  m,  and  energy  e and  F = (m,p+m  /p,(e+p)m/p)  where  p is  given 
in  terms  of  p,  and  possibly  also  m and  e,  by  some  "equation  of  state" 


p = f (p ,m,e) 


Then  the  evaluation  of  F can  be  done  by  vector  operations  as  indicated  in 

2 

m.  m. 

Fj  = h’Vv  (ei+pj}^  > > j = 

^ J 

where  the  calculation  of  the  vector  of  p values  may  or  may  not  also  be 
computed  efficiently  by  vector  operations  depending  on  the  form  of  f. 

After  F has  been  evaluated,  vector  instructions  similar  to  (4.13)  can 
be  used  for  the  second  step  of  (4.12).  In  addition,  we  will  need  to  handle 
the  given  boundary  conditions  as  well  as  the  computational  boundary  condi- 
tions obtained,  for  example,  by  extrapolation  (see  e.g.  Gottlieb  and  Turkel 
[1976]  for  a review  and  analysis  of  various  extrapolation  strategies). 

The  above  approach  could  be  made  more  efficient  if  vectors  of  length  MN 
could  be  used  rather  than  just  of  length  M.  In  some  cases,  this  will  be 
possible  using  some  techniques  that  we  discuss  next  for  elliptic  equations. 


[ 


f 

[ 

l 

i 
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Elliptic  Equations  In  Two  Dimensions 

The  implementation  of  many  of  the  the  usual  iterative  methods  for 
discrete  elliptic  equations  has  been  studied  rather  extensively  by 
Erickson  [1972],  Hayes  [1974],  and  Lambiotte  [1975],  primarily  for  the  ILLIAC 
IV,  the  ASC,  and  the  STAR-100,  respectively,  and  Morice  [1972]  for  general 
parallel  processors.  We  will  review  in  this  section  many  of  the  issues 
they  have  discussed  and,  for  simplicity,  refer  to  their  work  collectively 
by  EHLM  unless  an  individual  reference  is  more  appropriate.  We  also  note 
here  the  paper  by  Heller  [1977],  which  surveys  many  aspects  of  iterative 
methods  for  parallel  computers. 

The  detailed  considerations  of  EHLM  were  primarily  restricted  to  the 
model  problem  of  Poisson's  equation  on  a square  with  Dirchlet  boundary 
data  and  using  the  five  point  difference  star , and  this  will  be  our  star- 
ting point  also.  Such  a problem  would,  of  coarse,  actually  be  solved  by 
one  of  the  fast  Poisson  solvers  mentioned  in  the  last  section  but  it 
makes  a convenient  example  with  which  to  treat  many  of  the  issues  of 
vectorization  that  arise  in  more  general  problems . 

The  discrete  domain  is  indicated  in  Figure  5 where  the  boundary  points 
are  indicated  by  bold  dots  and  N is  the  number  of  interior  points  in 
each  row  and  column. 

• e • • • 

0 • • • • 

0 • • • 

2N+5 

0 • • • • 

N+3  N+4 

a o • • • 

1 2 

Figure  5 
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Consider  Che  Fortran  loop 


DO 

1 

J = 2, 

N + 

1 

DO 

1 

I = 2, 

N + 

1 

1 UN ( I , J ) = . 25*(U  (1  + 1, J)+U (1-1 , J)+U (I , J + l)+U (I , J-l ) ) 

where  the  current  values  of  the  unknowns  as  well  as  the  boundary  values 
are  stored  in  an  (N+2)  x (N+2)  array.  Equation  (4.14)  defines  the  arithme- 
tic step  of  the  Jacobi  iteration,  a process  which  (for  general  linear  sys- 
tems as  well  as  the  discrete  laplacian  represented  by  (4.14))  has  been  ex- 
tensively cited  as  a prototype  parallel  method.  However,  care  is  needed  In 
certain  aspects  of  its  implementation  in  order  to  achieve  the  fullest  possible 
vectorization.  For  example,  (4.14)  would  be  more  efficiently  implemented  on 
the  ILLIAC  in  a row  by  row  fashion  if  N were  64  or  a multiple  thereof  and 
an  inefficiency  of  up  to  a factor  of  2 could  result  for  other  values.  On 
the  STAR  and  ASC,  on  the  other  hand,  we  would  like  the  vector  lengths  to 
be  as  long  as  possible. 

For  any  reasonable  value  of  N (4.14)  translates  directly  into  4 

hardware  instructions  on  the  ASC  which  operate  on  vectors  of  effective 
2 

length  0(N  ).  On  the  STAR,  this  is  not  possible  since  a vector  consists 

only  of  contiguous  memory  locations.  One  could  carry  out  (4.14)  row  by 

row  but  then  the  vectors  would  only  be  of  length  N and  one  would  pay 

N times  the  number  of  start-up  penalties.  Alternatively,  we  can  use 

2 

vectors  of  length  0(N  ) by  treating  the  boundary  positions  as  unknowns. 

2 

That  is,  let  U now  denote  an  (N+2)  long  one-dimensional  array  with 
the  lexiographic  ordering  of  Figure  5,  and  again  use  the  notation  U(K;L) 
to  denote  the  I.-long  subvector  starting  at  the  Kth  position  of  U.  With 
Ml  = (N+l)(N+2)  - 1 and  M2  = N(N+2)  - 2,  we  can  then  implement  (4.14) 
by  the  instructions 
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(4.15) 


T (2  ; Ml)  = U ( 2 ; Ml ) ±U(N+3;M1) 
U (N+4  ;M2)  = T ( 2 ; M2 ) ±T(N+5;M2) 


where  T is  a temporary  vector  and  where  we  have  used  i to  denote  an 
"average"  instruction,  that  is,  addition  followed  by  division  by  2.  Such 
an  instruction  is  available  on  the  STAR  and  takes  essentially  the  same 
time  as  an  addition. 

As  a penalty  for  using  vectors  whose  length  is  the  total  number  of  grid 
points,  the  final  instruction  of  (4.15)  will  write  on  to  the  positions 


2N+4,  2N+5 , 3N+6,  3N+7,** 


corresponding  to  most  of  the  boundary  positions  along  the  vertical  sides, 
thus  destroying  the  correct  boundary  values  in  those  positions.  One  would 
then  ha^e  to  restore  these  values  before  the  next  iteration.  On  the  STAR, 
however,  there  is  a convenient  feature  which  permits  storage  to  be  controlled 
by  a bit  vector  (the  control  vector) ; this  can  be  used  to  ensure  that  the 
boundary  positions  are  not  overwritten  and,  hence,  no  "fixing  up"  is  needed 
before  the  next  iteration.  Let  B be  a bit  vector  with  0's  in  those 
positions  corresponding  to  the  boundary  points  in  the  vector  U (N+4;  M2) 
and  l's  in  the  interior  grid  point  positions.  The  last  instruction  of 

(4.15)  would  then  be  replaced  by 

(4.16)  U (N+4 ; M2 ) • Control  B = T(2;M2)  ^T(N+5;M2) 

indicating  that  storage  is  suppressed  into  those  positions  of  U(N+4;M2) 

corresponding  to  0's  in  B i.e.,  corresponding  to  the  boundary  positions. 

The  instruction  time  is  no  greater  using  the  control  vector  but,  of  course, 

2 

one  pays  the  penalty  of  storage  of  approximately  N /64  words  of  storage  for 
the  bit  vector. 
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The  temporary  vector  in  (4.15)  can  be  eliminated  by  storing  the  result 
of  the  first  instruction  back  onto  U itself.  However,  even  using  a 
control  vector,  this  will  necessarily  destory  about  half  of  the  boundary 
values  which  will  then  have  to  be  replaced. 

Whereas  the  Jacobi  iteration  is  often  cited  as  a "perfect"  parallel 
algorithm,  the  Gauss-Seidel  and  SOR  iterations  are  usually  considered  to 
be  the  opposite.  The  usual  serial  code  for  Gauss-Seidel  in  the  context 
of  the  DO  loop  (4.14)  would  have  UN  replaced  by  U itself  so  that  new 
values  at  each  point  replace  the  old  as  soon  as  they  are  computed;  it  is 
this  process  that  is  not  amenable  to  vectorization . However,  EHLM  have 
shown  that  by  using  the  classical  red-black  ordering  of  the  grid  points, 
as  shown  in  Figure  6, 
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Figure  6.  The  Red-Black  Ordering 

Gauss-Seidel  can  be  carried  out  in  essentially  the  same  fashion  as 

N2 

iteration  but  using  vectors  of  length  ) corresponding  to  the 

and  the  black  points.  The  boundary  points  would  be  handled  in  the 
with  Jacobi's  method.  The  time  per  iteration  for  SOR  carried  out  in  this  way 


the  Jacobi 
red  points 
same  way  as 
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should  be  little  more  than  for  the  Jacobi  iteration  so  that  the  SOR 
method  is  potentially  very  useful  for  vector  machines.  Lambiotte  [1975] 
also  considered  a diagonal  ordering  for  the  grid  points  but  showed  that 
this  is  inferior  to  the  red-black  ordering. 

Similarly  to  SOR,  the  Alternating  Direction  Implicit  (ADI)  method  seems, 
at  first  glance,  to  be  rather  unsatisfactory  for  vector  computers  since  it 
is  based  on  the  solution  of  tridiagonal  equations.  Erickson  [1972]  and 
Morice  [1972],  however,  observed  that  since  these  tridiagonal  systems  are 
independent,  they  can  be  solved  in  parallel.  More  precisely,  we  recall  that 
the  ADI  method  - for  the  model  problem  and  the  grid  of  Figure  5 - consists 
of  two  half-steps  as  indicated  by  the  iteration  scheme  (see,  e.g.  Varga  [1962]) 

(4.17a)  (H+ctkI)xk+^  = (akI-V)xk  + b 

(4.17b)  (V+akI)xk+1  = (akI-H)xk+*  + b 

The  first  step,  (4.17a),  consists  of  the  solution  of  N tridiagonal  systems 
corresponding  to  the  horizontal  lines  of  the  grid,  while  (4.17b)  likewise  is 
the  solution  of  N tridiagonal  systems  (after  permutations  of  the  unknowns) 
corresponding  to  the  vertical  lines.  On  the  ILLIAC,  up  to  64  of  these  tri- 
diagonal systems  can  be  solved  in  parallel  using  the  usual  Gaussian  elimina- 
tion algorithm.  Moreover,  the  storage  mechanism  on  the  ILLIAC  is  such  that 
the  alternating  sweeps  can  be  handled  with  little  difficulty. 

On  the  STAR  or  ASC,  the  vectorization  would  be  across  the  tridiagonal 
systems;  that  is,  suppose,  more  generally,  that  we  have  N nxn  linear 
systems 
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AOOx(k)  = b(k) 


k = 1 , • * * , N 


• u ,(k)  , (k)  . , (k)  , (k).  ,,  , 2 

with  A = (a^  ) and  b = (b^  ).  We  then  set  up  n +n 


vectors 


(4.18) 


» , 1 N . 

Au  • <air"  •au> 


B,  - (bj,---,bj) 


i = 1 , • • • ,n 


and  solve  the  N systems  by  carrying  out  Gaussian  elimination  in  the 
usual  fashion,  but  now  on  the  vectors  of  (4.18). 

In  order  to  implement  this  for  the  tridiagonal  systems  in  the  ADI  method, 
it  is  primarily  a question  of  arranging  the  storage  layout  correctly.  On 
the  ASC  this  can  be  done  with  little  difficulty,  and  with  essentially  the 
usual  serial  Fortran  code,  because  of  that  machine's  ability  to  treat  non- 
contiguous elements  as  vectors.  On  the  STAR,  however,  the  storage  must  be 
rearranged  - by  the  equivalent  of  a matrix  transpose  - between  each  sweep. 
Lambiotte  [1975]  suggests  the  following  alternative:  On  the  half-sweep  that 
the  storage  is  not  correct  for  the  simultaneous  solution  of  the  tridiagonal 
systems,  it  _is  correct  for  the  solution  of  the  individual  tridiagonal  systems 
by  the  cyclic  reduction  (CR)  method  of  the  previous  section.  Moreover,  the 
N individual  systems  may  be  viewed  as  forming  a single  tridiagonal  system 
N times  as  large  and  the  CR  method  may  be  applied  to  this  large  system; 
as  we  saw  in  the  last  section,  the  larger  the  system  the  better.  Finally, 

because  the  individual  systems  are  uncoupled,  the  CR  method  will  actually 

2 

terminate  in  log2N  steps  rather  than  the  expected  log2N  . Thus,  the  ADI 
algorithm  is  implemented  by  solving  the  tridiagonal  systems  "in  parallel" 
on  one  half-sweep  and  as  a single  large  tridiagonal  system  on  the  other 
half-sweep.  Lambiotte  also  discusses  a similar  strategy  for  three  dimensional 
problems . 
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In  a limited  number  of  recent  experiments  on  the  STAR,  however. 


Lambiotte  [1977]  has  shown  that  it  actually  seems  to  make  little  difference 
in  running  time  whether  the  above  ADI  algorithm  is  implemented  or  the  tri- 
diagonal systems  solved  in  parallel  on  each  half-sweep  with  a matrix  trans- 
pose in  between.  This  is  presumably  due  to  the  fact  that  there  is  a hard- 
ware 8x8  matrix  transpose  instruction  on  the  STAR  which  forms  the  basis  for 
an  efficient  transpose  procedure. 

Many  of  the  same  considerations  as  for  ADI  apply  also  to  the  implementa- 
tion of  Successive  Line  Over-Relaxation  (SLOR).  If  one  uses  the  lexiographic 
ordering,  the  same  difficulties  occur  as  with  point  SOR.  To  circumvent  this, 
Ericksen  [1972]  and  Lambiotte  [1975]  study  various  other  orderings,  such  as 
the  red-black  ordering  by  lines,  which  allow  a number  of  the  tridiagonal 
systems  either  to  be  solved  in  parallel  or  as  one  large  tridiagonal  system 
by  cyclic  reduction. 

Another  group  of  methods  which  may  be  potentially  useful  on  vector 
machines  is  the  semi-iterative  (SI)  methods.  (See  e.g.  Young  [1971]  for  a 
general  discussion  of  these  methods.)  Consider,  for  example,  the  Jacobi-SI 
method  which  can  be  written  in  the  form 

//,  iqi  k+1  _ k . n k , k-1 

(.A.iy;  u = a^Bu  + B^u  + 

. 

for  suitable  choice  of  the  parameters  a , 8 , and  Y*  Here  B is  the  Jacobi 

iteration  matrix  so  that  Bu  is  the  result  of  a Jacobi  sweep  starting 

from  u , and  the  remainder  of  the  calculation  of  (A. 19),  once  the  parameters 

are  known,  is  ideally  suited  for  vector  machines.  Of  course,  one  pays  the 

k-1 

penalty  of  additional  storage  for  u . More  importantly,  the  choice  of 
good  parameters  may  be  difficult  for  other  than  the  model  problem:  the  opti- 
mal  parameters  are  based  on  a knowledge  of  the  largest  and  smallest  eigenvalues 


(assumed  real)  of  B.  In  the  case  that  the  coefficient  matrix  A is 


symmetric  positive  definite  and  has  property  A,  then  it  is  known  that  the 
asymptotic  rate  of  convergence  of  Jacobi-SI  is  approximately  half  that  of 
SOR,  both  using  optimal  parameters;  in  this  case,  Jacobi-SI  may  not  be 
useful,  even  on  vector  computers.  However,  in  more  general  situations,  the 
rate  of  convergence  of  Jacobi-SI  may  be  quite  superior  to  SOR  and  its  some- 
what better  vectorization  properties  makes  it  potentially  attractive,  pro- 
vided that  reasonable  values  of  the  parameters  can  be  chosen.  Hayes  [1974] 
and  Lambiotte  [1975]  have  considered,  for  the  ASC  and  STAR,  respectively, 
the  Jacobi-SI  method  in  some  detail,  as  well  as  other  semi-iterative  methods 
such  as  SSOR-SI  and  cyclic  Chebyshev-SI . 

So  far  we  have  considered  explicitly  only  the  model  problem  of  Poisson's 
equation  on  a square,  although,  of  course,  parts  of  the  previous  discussion 
apply  more  generally.  With  a more  general  elliptic  operator  - but  still  a 
rectangular  region  - the  difficulties  will  tend  to  revolve  around  the  best 
ways  to  compute  and  manage  storage  of  the  coefficients.  For  example,  con- 
sider the  equation 

(4.20)  au  +bu  +cu  = f , 0 < x,y,z  < 1 

where  a,  b,  and  c are  functions  of  x,  y,  and  z.  The  corresponding  difference 
equations  using  the  usual  7-pt  formula  with  h Ax  = Ay  = Az  are 


2(a..,+b,.,+c...)u..,  - a . . , (u  . . , ,,+u.  , . ,) 

ljk  ljk  ljk  tjk  r]k  l + 1 , j k l - 1 , j , k 


- b . . . (u . . . , ,+u..  , . ) - c . . . (u . . , , , +u . . . .)  = h f . . 

ijk  ij  + l,k  1 J “ 1 , k ljk  ijk  + 1 ijk-1  ij 


For  a sufficiently  coarse  grid,  the  coefficients  can  all  be  computed  once 

-1 

and  for  all  and  held  in  five  0(N  ) long  arrays.  But  for  a moderately  fine 
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grid,  say  N = 50-100,  out-of-core  storage  would  be  required  and  depending 
upon  the  complexity  of  the  coefficients,  the  operating  system,  and  various 
other  factors,  it  may  be  more  economical  to  recompute  the  coefficients  at 
each  iteration.  This  strategy  is,  of  course,  common  on  existing 
serial  machines  and  the  only  new  factor  for  vector  computers  would  he  to 
compute  the  coefficients  in  sufficiently  large  batches  - say  1,000  - 10,000 
at  a time  - sc  that  the  computation  as  well  as  the  subsequent  usage  in  the 
iterative  scheme  can  be  done  efficiently  with  vector  operations. 

A less  satisfactory  situation  exists  for  handling  irregular  domains. 
Consider,  for  example,  the  grid  in  Figure  7 

X X • • • • X X 

X X «...  • X 

X • • . . . . • 

. ......  • 

X • . . . . • X 

X X • • • • X X 


Figure  7 

where  the  boundary  nodes  are  indicated  by  bold  dots.  One  way  to  handle 
such  a grid  is  to  circumscribe  it  by  a rectangle  - the  additional  grid  points 
thus  introduced  are  indicated  in  Figure  7 by  crosses  - and  work  with  the  en- 
tire rectangular  grid.  For  example,  for  Laplace's  equation  and  the  Jacobi 
iteration  on  such  a grid,  one  could  use  the  DO  loop  (4.14)  on  the  ASC  or  the 
vector  code  (4.15)  on  the  STAR.  For  the  ASC,  the  values  at  the  boundary  nodes 
in  the  int°rior  of  the  rectangle  will  be  destroyed  and  must  be  replaced  prior 
to  the  next  sweep.  On  the  STAR,  a control  vector  can  be  used,  as  before,  to 
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ensure  Chat  the  boundary  positions  are  not  overwritten.  Of  course,  both 
additional  storage  as  well  as  additional  arithmetic  are  required  for  the 
rectangular  points  outside  of  the  domain  and  the  procedure  becomes  in- 
creasingly less  efficient  as  the  domain  deviates  from  a rectangle.  At  some 
point,  it  is  probably  beneficial  to  use  a union  of  smaller  circumscribing 
rectangles,  as  indicated  in  Figure  8, 


x , , x 

• • 

• • 

• • 

• • 

X 

X 

• 

Figure  8. 


for  the  case  of  two  such  rectangles.  This,  of  course,  would  save  considerable 
storage  over  a complete  circumscribing  rectangle  but  now  the  two  rectangles 
must  be  processed  separately.  That  is,  the  DO  loop  (4.14)  or  the  code  (4.15) 
must  be  written  separately  for  the  two  rectangles  to  take  account  of  the 
different  row  lengths  as  the  dotted  line  is  crossed.  Ideally,  of  course, 
one  would  like  an  ordering  of  the  grid  points  that  would  allow  processing 
and  storage  of  only  the  minimum  number  of  points  and  still  use  vectors  whose 
length  is  the  total  number  of  grid  points;  but  such  an  ordering,  if  it  exists, 
is  not  evident  and  has  not  appeared  in  the  literature. 
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One  might  argue  that  the  above  discussion  on  iterative  methods  for 


elliptic  equations  using  finite  difference  discretizations  is  somewhat  ir- 
relevant since  the  trend  of  the  last  several  years  has  been,  increasingly, 
to  use  projection  methods  (i.e.  finite  elements,  Galerkin,  etc.)  to  dis- 
cretize elliptic  equations  and  then  direct  methods  for  the  solution  of  the 
algebraic  systems.  We  note  several  points  about  this.  First,  finite  differ- 
ences have  been  used  here  primarily  to  simplify  the  discussion  but  many  of 
the  same  considerations  would  apply  to  the  discussion  of  iterative  methods 
for  finite  element  or  other  projection  discretizations.  Secondly,  iterative 
methods  seem  to  vectorize  somewhat  better,  on  the  whole,  than  direct  methods 
and  the  balance  of  favor  may  well  shift  towards  iterative  methods  for  vector 
computers, especially  for  three  dimensional  problems.  Thirdly,  as  mentioned 
previously,  many  of  the  considerations  for  handling  iterative  methods  go 
over  directly  to  time-marching  methods  for  initial-boundary  value  problems. 
Finally,  iterative  methods  play  a prominent  role  in  the  "multigrid"  method, 
one  of  the  currently  most  promising  approaches  to  the  solution  of  elliptic 
equations.  (See  Brandt  [1977]  and  Nicolaides  [1975]  for  discussion  of  this 
method. ) 
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5.  Summary  and  Prognosis 


The  use  of  large  scale  vector  computers  in  scientific  computation  is 
barely  four  years  old  and  the  serious  study  of  the  best  ways  to  use  such 
computers  has  been  largely  restricted  to  the  half  dozen  or  so  government 
laboratories  where  such  computers  exist  or  where  active  preparation  for 
one  is  under  way.  The  university  community  of  numerical  analysts  and 
others  in  scientific  computing  has  been  rather  little  involved  to  date  in 
developing  or  analyzing  numerical  methods  for  such  computers.  Moreover, 
the  existing  machines  are  sufficiently  different  in  detail  that  it  is 
not  always  the  case  that  an  algorithm  which  is  best  for  one  machine  will 
even  be  satisfactory  for  another. 

Nevertheless,  progress  has  been  made.  The  implementation  of  the  usual 
direct  methods  for  solving  linear  systems  of  equations  is  now  well  under- 
stood although  not  altogether  satisfactory:  Gaussian  elimination  or  simi- 
lar factorization  methods  become  relatively  less  efficient  as  the  band- 
width decreases;  in  the  important  limit  of  tridiagonal  matrices,  cyclic 
reduction  appears  to  be  the  best  method  although  for  matrices  of  moderate 
size  (n  = 100-500)  it  is  relatively  little,  if  at  all,  better  than 
Gaussian  elimination  in  scalar  code. 

Most  of  the  usual  iterative  methods  have  now  been  studied  to  some  extent 
and,  in  general,  seem  somewhat  more  competitive  on  vector  computers  (vis-a-vis 
direct  methods)  than  on  serial  ones.  But  most  analysis  to  date  has  been 
on  simple  model  problems  and  many  questions  remain  for  more  realistic  sit- 
uations: irregular  domains,  systems  of  equations,  variable  coefficients, 

etc.  And  some  of  the  most  potentially  promising  iterations,  such  as  the 
conjugate  gradient  method  or  the  multigrid  method,  have  not  yet  been  seri- 
ously studied. 
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For  time-dependent  Init ial-boundary  value  problems  the  implementation 
of  explicit  methods  has  much  in  common  with  that  of  iterative  methods  for 
elliptic  equations.  But  little  analysis  or  comparative  study  has  yet  been 
carried  out  although  various  methods  are  being  used.  The  issue  of  explicit 
versus  implicit  methods  for  parabolic  equations  is  still  unclear  although 
explicit  methods  seem  relatively  more  advantageous  on  vector  computers. 

The  method  of  lines  — which  is  becoming  increasingly  popular  as  a basis  for 
"automatic"  programs  — has  not  yet  been  seriously  studied. 

All  of  the  existing  vector  computers  have  both  strong  and  weak  points. 
Ideally,  we  would  like  machines  that  allow  maximum  flexibility  in  the 
machine  definition  of  a vector  - at  a minimum,  equally  spaced  storage  words 
should  be  allowed;  minimum  overhead  (start-up  time)  for  vector  hardware 
operations;  no  penalty,  relative  to  the  fastest  available  technology,  for 
the  use  of  scalar  arithmetic;  sufficient  amounts  of  fast  memory  to  balance 
the  high  CPU  processing  rates,  and  so  on.  On  the  software  side,  we  especial- 
ly need  more  sophisticated  programming  languages  and  compilers. 

None  of  the  existing  vector  computers  have  all  of  these  desirable  fea- 
tures but,  then,  we  are  only  in  the  first  generation  of  such  machines.  More- 
over, we  seem  to  be  witnessing  a genuine  revolution  in  hardware  and  increasing 
ly  serious  speculation  is  being  made  about  array  processors  of  the  future 
which  will  utilize  thousands  of  independent  microprocessors.  (For  a progno- 
sis of  computer  power  in  the  1980's,  see  Stone  [1976].)  The  knowledge  we 
are  slowly  gaining  today  about  the  efficient  use  and  the  drawbacks  of  the 
current  vector  computers  will  surely  be  useful  for  the  next  generation  of 
such  computers. 
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A MATHEMATICAL  SOLUTION  TO  THE  BALLISTIC  DIFFERENTIAL  EQUATION 
PROGRAMMED  FOR  THE  TEXAS  INSTRUMENT  SR-52  CALCULATOR 
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I,  ABSTRACT,  The  fundamental  differential  equation  of  motion  for  a 
projectile  is 


4 4 ^ 

V + HV  - g = 0 


where  V is  the  projectile  velocity,  g is  the  acceleration  due  to  gravity 
and  H is  a function  of  the  dependent  variable  v . If  it  were  not  for 
the  fact  that  H is  a complicated  function  of  the  magnitude  of  projectile 
velocity,  then  equation  (1.1)  would  have  a closed  form  solution.  Specifically, 
H is  defined  by  equation 


(1.2) 


pd2v  K 


where 

d = diameter  of  projectile  (meters), 
w = weight  of  projectile  (kilograms), 
v ■=  speed  of  projectile  (meters/second), 

Kp  = drae  function  and 
_ -hz 

p = P»e 

with  p*  = standard  air  density  (kilograms/meters  ),  a = altitude  of 
projectile  (meters)  and  h = 0.000045  Logjg  per  meter.  Observe  that  the 
unit  of  measure  for  H is  a real  number  per  second. 

In  this  paper  the  fundamental  ballistic  equation  is  solved  by  an 
iterative  method  which  generates  a five  degree  of  freedom  program^for  the 
SR-52  with  the  following  outputs:  <l)  time  of  flight  (seconds),  (2J  range 
(meters),  @ altitude  (meters)^  (h)  remaining  projectile  speed  (m^ters/sec) , 

(D  range  rate  (meters/sec),  & altitude  rate  (meters/sec)  and  © projectile 
angle  of  rise-fall  (radians). 

II.  INTRODUCTION.  The  fact  that  there  is  no  closed  form  solution  to  the 
non-linear  ballistic  equation  indirectly  lead  to  the  development  of  the  ENLAC 
in  1946  at  the  University  of  Pennsylvania.  This  very  large  vacuum  tube 
machine  was  man's  first  all-electronic  digital  computer  and  one  of  its  tasks 
was  to  produce  projectile  ballistic  tables.  This  paper  gives  a five  degree 
of  freedom  ballistic  program  for  the  hand-held  SR-52  programmable  calculator. 
You  will  find  that  the  tables  generated  on  the  SR-52  are  reasonably  accurate. 

A numerical  integration  method  (some  quadrature  formula)  is  most 
frequently  used  to  solve  equation  (1,1).  This  technique  requires  a 
reasonably  large  computer  ar.d  it  is  long.  The  solution  method  described  in 


PRECEDING  PAGE  BLaNK-NOT  FILMED 


this  paper  is  so  simple  that  others  must  have  discovered  it  before  I did. 

In  any  case,  I first  discovered  the  iterative  solution  in- 1963  while  working 
as  a consultant  for  the  Fire  Control  Directorate  at  Frankford  Arsenal.  At 
this  time,the  Directorate  was  performing  an  error  analysis  on  an  AA  weapon. 
While  waiting  for  BRL  ballistic  tables  to  be  generated  we  decided  to 
generate  our  own  to  validate  the  error  analysis  model.  However,  the  AA 
weapon  error  analysis  was  to  be  performed  on  BRL  ballistic  data.  It  was 
a pleasant  surprise  to  find  that  the  tables  generated  by  our  iterative 
method  agreed  with  the  BRL  tables  to  within  5 meters  in  slant  range  for 
gun  elevation  angles  between  20°  and  80°  and  for  slant  ranges  out  to  2500 
meters  (maximum  slant  range  for  the  error  analysis).  In  the  sequel  you 
will  observe  that  the  accuracy  of  data  depends  on  the  iteration  increment 
( it  ' and  maximum  slant  range  computed. 

III.  Solution  of  the  Fundamental  Ballistic  Equation 


In  the  cartesian  X-Y  plane,  the  fundamental  ballistic  equation  becomes 


0 


g = 0 . 

As  mentioned  in  the  introduction,  a solution  to  the  above  ballistic  equations 
in  closed  form  is  impossible  for  the  form  of  H given  in  equation  (1.2). 

The  parameters  d,  w and  of  H are  determined  by  the  projectile.  Frequently, 
is  given  as  a function  of  mach  number  as  shown  in  Figure  A.l.  However, 

Kjj  may  also  be  given  as  a function  of  remaining  projectile  speed.  In  either 
case  a fourth  degree  polynomial  fit  is  adequate. 


(3.1) 

ft  + = 

dt2  dt 

(3.2) 

ft  + H ^ + 

dt2  dt 

Mach  Number 


Figure  A.l 
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Recall  that  P (air  density)  varies  with  altitude,  air  temperature  and 
percentage  of  water  vapor  in  the  air,  but  P is  mainly  a function  of  altitude 
(a)  for  many  ballistic  projectiles.  The  program  for  the  SR-52  uses  the 
standard  air  density  formula  p = p#e_hz  which  is  adequate  for  the 

range  of  altitudes  usually  considered.  Observe  that  here  p is  a function 
of  altitude  only. 

Equations  (3.1)  and  (3.2)  are  solved  under  the  assumption  that  H does 
not  change  during  the  time  interval  At  which  is  a reasonable  assumption 
when  0 - At  < 0.01.  Under  this  assumption  the  ballistic  equations  are 
linear  differential  equations  of  order  two  with  constant  coefficients  which 
have  the  following  closed  form  solutions: 


(3.3) 

(3.4) 


C1  - cie 


-Ht 


y = 


k - k e~Ht 
*1  V H 


for 


0 ^ t < At. 


The  time  derivative  of  each  equation  is 


(3.5) 

(3.6) 

(3.7) 


for  o = t 

\3.8) 

for  0 = t 

described  by 


II 

3|S 

„ -Ht 

C-jHe 

II 

v = 

< At  . Let  $(t)  be  defined  by  the  equation 
tan  *(t)  = |r  / 

K 4t  . Then  the  arbitrary  constants  c^  and  are 


(3.9) 

(3.10) 


v cos  $ 

" He"Ht 

_ v sin  <ji  + g/H 


He 


-Ht 


for  0 = t < At  . The  equations  (3.3)  through  (3.10)  are  used  to 

generate  a ballistic  taole  (see  table  attached)  for  a projectile  for  which 
d,  w»  ^D>  Vq  (muzzle  velocity)  and  (gun  firing  angle)  are  known. 
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The  incremental  technique  programmed  (see  Ballistic  Program  table)  is  the 
following.  Let  the  muzzle  of  the  gun  be  placed  at  the  origin  in  the  X-Y 
coordinate  system  and  assume  that  the  projectile  is  fired  with  a muzzle 
velocity  vq  at  an  angle  <J>o  measured  from  the  X-axis.  The  value  of  IC  at 
time  t (initially  t = 0)  which  is  constant  during  the  interval  0 - t < At 
is  determined  from  the  polynomial  fit  for  1C  simply  by  substituting  for  v 
(initialy  v = v0).  Once  1C  is  determined,  tlien  H is  determined  using 
the  parameters  of  the  projectile,  the  present  value  of  v and  the  value 
of  p evaluated  for  the  present  altitude  of  the  projectile.  Equations 
(3^.9)  and  (3.10)  determine  the  arbitrary  constants  c^  and  k^  in  the  interval 
0 = t < Atfor  the  present  value  of  v,  and  the  value  of  H 
computed  above.  The  position  and  velocity  of  the  projectile  at  time 
t = nAt( initially  n = l),  are  derived  from  equations  (3.3),  (3.4), 

(3-5)  and  (3-6).  For  n = 2 the  origin  of  the  coordinate  system  is 
translated  to  the  present  position  of  the  projectile  just  computed. 

Observe  that  equations  (3.3)  through  (3.10)  are  solutions  for  the  next 
time  interval  At  = t < 2At  but  with  initial  conditions  determined 
by  equations: 

1)  t = lAt 


2) 

3) 


x = y 


0 


= /(Hi  ~ 

vdt't=At' 


+ (^1  )2 

' dt ' t=At J 


kI  ( At ) 


Tan-1 ( 


d£,  . ox  I 

dtlt=At/  dt't=At 


) . 


Now  the  process  of  the  previous  paragraph  are  repeated.  After  each 
iteration  the  projectile's  position  is  found  by  summing  x. , y. . That 
is : 

range  = £x^,  altitude  = Ey^,  and  slant  range  = / (Xx^)2  + (Zy.)2 


IV.  CONCLUSION . Although  the  ballistic  solution  described  is  almost 
trivial,  the  accuracy  of  the  generated  tabular  data  (see  example  of  table 
on  next  page)  is  surprisingly  good.  The  accuracy  of  the  computed  data  can 
be  showen  to  be  a function  of  At  for  the  maximun  D computed.  For 
0 - At  = 0.01  you  can  expect  the  computed  slan£  range  data  to  be  less 
than  10  meters  for  D out  to  5000  meters.  However,  the  most  remarkable 
fact  is  that  this  method  can  be  programmed  on  a hand-held  SR-52  calculator. 
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APPROXIMATE  DECOMPOSITIONS  OF  SPARSE  MATRICES 


Henk  A,  van  der  Vorst 
Academic  Computer  Center  Utrecht 
Budapestlaan  6 
De  Uithof  - Utrecht 
Netherlands 


ABSTRACT.  The  discretisation  of  self-adjoint  elliptic  partial 
differential  equations  often  leads  to  large  linear  systems  with  a 
sparse  matrix.  An  attractive  way  of  solving  such  systems  arises  when 
an  approximate  decomposition  of  the  matrix  is  used  in  the  application 
of  the  conjugate  gradients  method.  Some  of  these  approximate 
decompositions  will  be  treated  in  more  detail. 

1.  INTRODUCTION.  We  consider  large  linear  systems 
( ] ) Ax  = b 

where  A is  a sparse  symmetric  matrix.  More  explictly,  the  matrix  A is 
assumed  to  arise  from  5-point  finite  difference  discretisation  of 
second  order  self-adjoint  partial  differential  equations.  However,  most 
of  the  ideas  can  be  extended  to  other  types  of  sparse  matrices. 

Both  direct  and  iterative  methods,  that  take  advantage  of  the  special 
sparse  structure  of  A have  been  studied  for  the  solution  of  (1). 

For  direct  methods  we  mention  papers  of  George  [ 1,2)  and 
Buzbee  et  al  { 3 , A ] . Iterative  methods  have  been  studied  by  Varga[  5]  , 
Young  [ 6)  , Conrus  & Golub  [ 7,8]  , O' leary  [ 9]  , Axelsson  [ 10,11]  and 
many  others. 

In  Bunch  & Rose  1 12]  and  in  Barker  [ 13]  many  recent  techniques  and 
results  for  direct  methods  as  well  as  iterative  methods  are  covered 
and  further  references  may  be  found  there. 

The  results  presented  in  this  paper  are  based  on  ideas  outlined  by 
Meijerink  & van  der  Vorst  [ 14].  They  propose  iterative  methods  that 
exploit  certain  approximate  decompositions  of  the  matrix  A.  These 
decompositions  can  be  represented  in  a sparse  way  and  are  in  general 
easy  to  construct  and  cheap  to  compute.  If  the  matrix  A is  positive 
definite,  then  the  conjugate  gradients  method  may  be  used  in  connection 
with  these  approximations,  which  in  general  results  in  very  fast 
iterative  methods. 

In  this  paper  a number  of  approximate  decompositions  for  some 
special  structured  matrices  are  demonstrated.  From  these  applications 
it  follows  clearly  how  to  construct  decompositions  for  other  types  of 
matrices.  Also  the  convergency-properties  of  the  different  methods 
are  treated  in  some  way  and  computational  aspects  are  discussed  briefly. 
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2.  THE  ITERATIVE  PROCEDURE.  Unless  otherwise  stated,  it  will 
be  assumed  throughout  this  paper  that  the  matrix  A=(a..)  is  a symmetric 


M-matrix.  A matrix  A is  an  M-matrix  if  a.  .<0  for  i°f  j , 


a 1 


exists  and 


A"1>0  (Varga  1 I,  151  ).  1J 

The  linear  system  (1)  might  be  solved  by  a decomposition  method  for 
the  matrix  A.  tn  genera1  the  decomposition  factors  are  far  less  sparse 
than  A,  and  therefore  it  appears  to  be  attractive  to  look  for 
approximate  decompositions  where  the  factors  are  sufficiently  sparse. 

The  existence  of  such  sparse  factorizations  is  guaranteed  by  a 
theorem  given  in  111.  That  theorem  states  that  a sparsity  pattern 
for  the  factor  L (in  A=LL^+R)  can  be  chosen  in  advance.  Since  K=LLT 
is  a positive  definite  symmetric  matrix,  this  approximate  decomposition 
can  be  used  in  combination  with  the  conjugate  gradients  method. 

The  resulting  iterative  scheme  can  be  written  as: 

x(1  is  an  arbitrar-’  initial  approximation  to  x 


r()  = b 


Ax„ 


po = K 


(2) 


1 'i’K  V 

(p.'.ApT} 


xi+l 

- X. 
1 

+ U.p. 

1 i 

ri+l 

It 

H* 

- C.Ap. 

8-  = 
1 

l*K'‘ri+l) 

(ri* 

- 1 

Iv  r . ) 
i 

pi+l 

= K~ 

1 r . + B . p J 

i+l  iri 

i=0. 


It  should  be  noted  that  K will  not  be  determined  explicitly,  vectors 
y=K~lz  will  be  determined  bv  solving  LL*y=z  in  two  steps.  For  details 
see  [ 14] . 


3.  TWO-DIMENSIONAL  PROBLEMS.  In  this  section  we  assume  the  matrix 
to  arise  from  5 point  finite  difference  discretisation  of 

(3)  — (ctu 1 ) 1 - (an')'  + yu  = 6 

x x y y 

ra(x,y)>0  , Y(x,y'>ih)  over  R , 

and  suitable  boundary  conditions  along  6R. 

For  R we  take  a rectangular  region  in  the  (x,y)-plane. 

This  leads  to  matrices  with  the  desired  properties  (see  section  1). 


The  elements  of  the  uppertriangular  part  of  A are  denoted  by  a.,  b.  and 
c.,  where  i is  counted  rowwise.  The  order  of  the  matrix  is  given  1 
by  N. 

3.1.  K=DIAG(A) . The  simplest  approximation  K of  A arises  when 
we  choose  all  off-diagonal  elements  of  L (in  K=LL^)  to  be  zero. 

Obviously  then  K is  a diagonal  matrix  whose  diagonal  elements  are 
equal  to  a£  ( the  diagonal  elements  of  A).  The  hybrid  conjugate  gradient 
method,  that  results  from  this  choice,  is  equivalent  with  the  conjugate 
gradients  algorithm  applied  directly  on  the  scaled  matrix  A,  where  the 
diagonal  elements  of  A are  used  as  scalings  factors.  This  scaling  is 

in  some  sense  optimal  for  the  conjugate  gradients  method  since  it 
minimises  among  all  scalings  the  condition  number  of  A [ 16]  . 

3.2.  ICCG(O).  A more  efficient  method  results  if  we  choose  the 
matrix  K in  such  a way,  that  its  factors  are  as  sparse  as  A itself.  It 

is  convenient  in  this  case  to  write  the  approximation  as  K=LoDoLo^»  where 
L0T  is  an  upper  triangular  matrix  and  Dq  is  a diagonal  matrix.  In  this 
way  square  roots  are  avoided,  but  the  main  advantage  is  the  property 
that  this  factorization  can  be  represented  by  only  one  diagonal. 
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r 


We  have 
(4) 


4 ‘ WoT  * *0 


«-Wo  > 


If  we  choose  Dg  to  be  the  inverse  of  the  diagonal  of  LqT,  then  this 
approximate  factorization  LqDqLq^  of  A,  is  uniquely  determined  by  the 
requirements  that  Rg  has  zero  entries  where  A has  nonzero  entries.  The 
elements  of  Lqt  are  denoted,  analogpusly  to  A,  by  a^ , b^  and  C£,  those  of 
Dq  by  3^ 


The  following  relations  are  easily  verified: 

~-l  ~2  _ 2 ~ 

a.  = d.  = a.  - b.  ,d.  , - c.  d, 
i l i l-l  l-l  l-ra  l-m 

(5) 


for  i=l,...,N  , where  m is  the  half  band-width  of  A.  Not  defined 
elements  should  be  replaced  by  zero's. 

From  the  relations  (5)  it  follows  that  only  the  elements  d^  need  to  be 
stored. 

The  resulting  hybrid  conjugate  gradients  method  is  called  ICCG(O)  [ 14]  . 

The  computational  efforts  for  this  iterative  method  are  roughly 
proportional  to  16N  multiplications  per  iteration. 

3.3.  ICCG(l).  The  next  approximation  is  defined  by 

(6)  A = L,DiL)T  + R]  (K  j =L  j D j L j T ) 

and  R]  is  required  to  have  zero's  where  A-Rq  (see  3.2)  has  nonzero's, 

D|  is  a diagonal  matrix  whose  elements  are  equal  to  the  inverses  of  the 
diagonal-elements  of  LjT.  From  this  it  follows  that  has  one  extra  non-zero 


542 


The  elements  are  defined  by: 


- — 1 ~2  ~ -2  -.2 
a.  = d.  = a.  - bf  ,d.  , - f.  . ,d.  . - cT  d. 

l l l l- 1 x- 1 i-m+ 1 i-m+ 1 l-m  l-m 


b.  = b.  — c.  f . d. 

l i i-m+li-m+1  i-m+1 


f.  = - c.  b.  d. 
l l-l  l-l  l-l 


for  i»l,2,.,.,N  , where  again  m is  the  half  band-width  of  A,  and  not 
defined  elements  should  be  replaced  by  zero's. 

Storage  requires  now  3 extra  vectors  of  length  N,  and  the  computational 
efforts  of  the  resulting  hybrid  conjugate  gradients  method  are 
proportional  to  18N  multiplications  per  iteration. 

3.4.  ICCG(3).  If  we  proceed  in  this  manner,  then  the  next 
approximation  is  characterized  by: 

(8)  A “ L3D3L3T  + R3  (K=L3D3L3T) 

This  approximation  follows  from  the  requirements  that  has  zero's 
where  A-R^-Rj  has  non  zero  elements,  and  from  the  choice  of  D3  to  be  a 
diagonal  matrix  equal  to  the  inverse  of  the  diagonal  of  L3T. 
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The  subscripts  3 have  been  used,  since  it  appears  that  L3  has  three 
extra  non-zero  diagonals  as  compared  to  the  upper  triangular  part  of  A: 


x '~\c. 
e . f . 1 


a.  b.  g. 
1 .1  6i 


The  elements  follow  from  the  following  relations: 

1 „ ~2  ~ ~2  ~ ~2 

d.  = a.  = a.  - b.  ,d.  . - g.  „d.  . - e.  i0d. 

1 1 1 l-l  l-l  bi-2  i-2  i-m+2  i-m+2 

f?  d.  - c2  d. 

i-m+1  i-m+1  l-m  l-m 


b.  = b.  - g.  b . d.  - c.  ^.f.  , d . 

1 1 l-l  l-l  l-l  i-m+1  i-m+ 1 i-m+1 


Ci-m+2ei-m+2^i-m+2 


e.  = - (c.  g.  d.  „ + f.  ,b.  . d . ) 

1 i-2°i-2  i-2  l-l  l-l  l-l 


f.  = - c.  ,b.  ,d.  , 
1 l-l  l-l  l-l 


i'-m+2^i-m+2^  i-m+2 


c.  = c. 
1 1 


for  i=l,2,...,N  , and  m the  half  band-width  of  A. 

An  extra  storage  of  5 vectors  of  length  N is  required  tot  store  L3T  and 
the  computational  efforts  of  the  resulting  hybrid  conjugate  gradients 
method  are  proportional  to  22N  multiplications  per  iteration. 


3,5.  HOW  FAR?  The  process  of  constructing  better  approximations  can 
be  continued  of  course  until  we  have  a complete  factorization.  It  will  be 
evident  that  there  is,  depending  on  the  problem,  some  optimum  choice 
somewhere  between  the  most  sparse  approximation  and  the  complete 
factorization.  The  factorization  next  to  the  one  in  section  3.4,  would 
have  been 


A=L7D7L7  +R? 


where  1.7^  has  7 extra  non-zero  diagonals  as  compared  to  the  equivalent 


r-r 


part  of  A.  The  representation  would  take  9 extra  vectors  of  length  N 
and  the  computational  costs  would  be  proportional  to  30N  multiplications 
per  iteration. 

Experiments  show  that,  compared  to  "pure"  conjugate  gradients  (i.e.  K=I) 
for  the  solution  of  Ax=b,  the  biggest  improvement  is  made  by  ICCG(O), 
and  in  most  cases  the  "optimum",  with  respect  to  computational  costs, 
is  reached  by  ICCG(3). 


4.  EIGENVALUES  OF  K A.  For  a better  understanding  of  these 
methods  it  is  instructive  to  see  how  the  eigenvalues  of  K“'A  behave. 

From  straightforward  computations  it  follows  that  the  iterative  processes 
based  on  approximate  decompositions,  in  combination  with  the  conjugate 
gradients  algorithm  for  the  solution  of  Ax=b,  are  equivalent  with 
conjugate  gradients  applied  on  the  linear  system 

(11)  L_,D_^AD"^LT_1y  = L-1D-^b  , x = D*LTy  , 

where  K=LDL^. 

For  conjugate  gradients  we  have  that 


(12) 


where 


lxi  ~ xMa  1 01 


|xQ  " x||A 


|x.  “ xllt  = (x.  - X,  A(x.-x) ) 

i A 1 1 


03) 


/c  - 

/c  + 1 


1 max 
X min 


and  Xmax,  Xmin  are  the  largest  and  smallest  eigenvalues  of  K~'a. 

However,  this  upperbound  ot  for  the  rate  of  convergence  appears  to  be 
far  to  pessimistic  in  general,  since  the  influence  of  components  in 
the  starting  vector  xq,  in  directions  of  eigenvectors,  dampes  out 
soon. 

On  the  other  hand,  it  is  well-known,  that  the  conjugate  gradients 
algorithm  takes  full  advantage  of  relative  clustering  of  the  eigenvalues 
[ 17],  This  is  the  reason  that  the  new  algorithms  are  so  successful, 
because  premultiplication  with  the  approximate  inverse  K-1  of  the  matrix  A, 
forces  most  of  the  eigenvalues  of  K“^A  towards  1. 

This  will  be  demonstrated  by  an  example.  Consider  the  linear  system 
arising  from  discretisation  of  Au=0  over  the  square  unit  region. 

The  boundary  conditions  are: 

-^  = 0 along  (2),  (3)  and 

(4),  and  u is  a given  (4) 

function  along  (1).  I 

A grid  is  chosen  with  meshwidths  yy  in  each  direction,  which  results  in 
a linear  system  with  992  unknowns  (N=992,m=31)  . 

In  order  to  demonstrate  the  effects  on  the  eigenvalues,  depending  on  the 
choice  of  approximation,  we  listed  in  table  1 the  five  smallest  and  the 
five  largest  eigenvalues  of  A and  of  K-1A  for  ICCG(O),  ICCG(l)  and 
ICCG(3) . 
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nr 


eigenvalue 

A 

ICCC,(0) 

ICCG(l) 

ICCG(3) 

1 

0.0025 

0.0045 

0.0119 

0.0252 

2 

0.0121 

0.0219 

0.0589 

0.1241 

3 

0.0223 

0.0391 

0.0983 

0.1900 

4 

0.0319 

0.0561 

0.1428 

0.2740 

5 

0.0409 

0.0712 

0.1840 

0.3614 

988 

7.904 

1.2106 

1.1815 

1.1583 

939 

7.922 

1.2108 

1.1861 

1 .1595 

990 

7.951 

1.2119 

1 . 1885 

1 . 1666 

991 

7.952 

1.2120 

1.1888 

1.1747 

992 

7.980 

1.2324 

1 . 1934 

1.1794 

Table  1. 


In  table  2 the  percentage  of  eigenvalues  that  are  in  the 
interval  [0.85,  1.15]  are  given. 


matrix 

percentage  in  [0.85,  1.15] 

A 

~ 40 

% 

ICCG(O) -matrix 

74 

% 

ICCG( 1 )-matrix 

92 

% 

ICCG( 3) -matrix 

97 

% 

Table  2. 


5,  3-DIMENSIONAL  PROBLEMS.  In  this  section,  two  approximate 
decompositions  will  be  given  for  symmetric  linear  systems  that  arise 
from  discretisation  of  3~dimensional  partial  differential  equations 

(14)  - (au')'  - (au')'  - (au1)'  + yu  = 5 

xx  y y z z ' 

over  a rectangular  region,  and  with  suitable  boundary  conditions. 

The  N-th  order  matrix  A of  the  resulting  linear  system  Ax=b  has  the 
following  form,  where  N=n*m*k  (n  is  the  number  of  grid  points  in 
the  x-direction,  m the  number  in  the  y-direction  and  k the  number  in 
the  z-direction) . The  non-zero  diagonals  of  the  upper  triangular  part, 
of  the  symmetric  matrix  A,  are  denoted  by  a,  b,  c and  d respectively, 
subscripts  are  counted  rowwise. 


5.1.  ICCG(O)  - 3 DIMENSIONAL.  Our  first  approximate  factorization 
is  defined  by 


(15) 


4 ■ WoT  * Ro 


(“Wo 


) 


where  Lq  and  Lq^  are  as  sparse  as  A,  and  Rq  has  zero  elements  on  places 
where  A has  non  zero  elements.  The  elements  of_the  diagonal  matrix  Dq 
are  denoted  by  A^ , the  elements  of  Lq^  by  a^,  b^,  cj  and  d^. 

The  values  of  these  elements  can  be  computed  from  the  following  relations: 


-1  _ ~2  -2  ~2 
A.  = 3.  = a.  - b.  A.  , - cf  A.  - df  A. 

l l l l- 1 l- 1 l-n  l-n  i-n*m  i-n#m 

b.  = b. 

(16)  1 

c.  = c. 

l l 

d.  = d. 

l l 

for  i=l ,2 , . . . ,N. 

Only  one  vector  of  the  length  N is  required  to  store  this  factorization. 
The  resulting  hybrid  conjugate  gradients  method  requires  computational 
efforts  proportional  to  20N  multiplications  per  iteration. 


5.2.  ICCG(3)  - 3 DIMENSIONAL.  The  next  (and  better)  approximation 
arises  when  an  approximate  decomposition  L3D3L3T  is  constructed,  where  L3 
and  T.3T  have  non-zero  elements  on  places  where  A-Rq  (section  5.1)  has 
non-zero  elements. 

This  decomposition  is  defined  by 

(17)  A = L3D3L3T  + R3  (K=L3D3L3T) 


Where  R3  is  required  to  have  zero  elements  on  places  where  A-Rq  has 
non-zero  elements,  and  D3  is  a diagonal  matrix  equal  to  the  inverse  of 
the  diagonal  of  L3T.  It  then  appears  that  L3T  has  three  extra  non-zero 
diagonals  as  compared  to  Lg^. 

The  elements  of  L3T  and  D3  follow  from: 

~ -1  ~2  -.2  ~2 

a.  = A.  = a.  - b.  A.  - e.  A.  - c.  A. 

1 1 1 1-1  1-1  i-n+1  i-n+1  l-n  l-n 

~2  ~2  -2 
- f.  A.  ^ - gf  _,A.  , - df  A. 

i-n#m+n  i-n#m+n  °i-n#m+l  i-n#m+l  i-n#m  i-n*m 

b.  = b.  - c.  .A.  ,e.  , - d.  ,A.  ,g. 

1 1 i-n+1  i-n+1  i-n+1  i-n*m+ 1 i-n#m+ 1 i-n»m+ 1 


(18) 


e.  = - c.  ,A.  ,b . 

1 l-l  l-l  l-l 


g . A . f . 

i-n*m+n  i-n#m+n  i-n#m+n 


c.  = c.  - d.  , A.  , f. 

1 1 i-n#m+n  i-n*m+n  i-n*m+n 


f.  = - g.  ,,A.  ^.e.  - d.  A.  c . 

1 °i-n+l  i-n+1  i-n+1  l-n  l-n  l-n 


g.  = - d.  .A.  ,b. 

1 l-l  l-l  l-l 


d.  = d. 
1 1 
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Six  vectors  of  length  N are  required  to  store  the  non-zero  diagonals  of 
L-jT.  The  resulting  hybrid  conjugate  gradients  method  requires  computational 
efforts  proportional  to  26N  multiplications  per  iteration. 

It  has  been  observed  that  in  general  ICCG(3)  - 3D  gives  a remarkable 
improvement  over  ICCG(O)  - 3D,  and  it  should  be  mentioned  here  that  leaving 
away  one  or  more  of  the  diagonals  e,  f and  g,  in  general  results  in  an 
iterative  method  which  is  only  little  faster  than  ICCG(O)  - 3D. 

6 . ADDITIONAL  REMARKS . From  the  examples  given  in  the  previous 
section,  it  is  easy  to  see  how  to  handle  matrices  with  different  structures, 
Especially,  the  variants  where  the  factorizations  are  as  sparse  as  the 
matrix  in  question,  are  extremely  simple  to  compute. 
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All  those  approximations  cause  no  trouble  as  long  as  the  original 
matrix  is  an  M-matrix.  It  should  be  stressed  here  that  positive 
definiteness  is  not  enough.  However,  some  statements  can  be  made  about 
symmetric  definite  sparse  matrices.  We  had  good  experiences  with  the 
following  strategy. 

During  the  process  of  constructing  an  approximate  Choleski-f actorization 
(A-LLT+R) , the  size  of  the  diagonal  elements  in  the  factors  is  monitored. 
As  soon  as  a small  element  is  encountered,  two  different  strategies  can 
be  followed,  the  second  of  which  is  the  best  in  our  experience. 

In  the  first  one,  a constant  is  added  to  this  diagonal  element  and  the 
process  is  continued. 

For  the  second  strategy,  one  should  realize  that  a diagonal  element  is 
computed  by  subtracting  a sum  of  squares  of  already  computed  off-diagonal 
elements  from  an  original  diagonal  element  or  the  matrix.  Occurrence  of 
small  diagonal  elements  can  now  be  prevented  by  replacing  one  or  more  of 
these  off-diagonal  elements  by  zero's.  This  makes  the  decomposition- 
process  locally  less  accurate,  but  in  general  small  elements  occur  only 
few  times  and  the  resulting  hybrid  conjugate  gradients  algorithm  yields 
satisfactory  results. 

ACKNOWLEDGEMENTS . Part  of  the  research  reported  in  this  paper, 
as  well  as  the  presentation  of  this  paper  at  the  Conference,  has  been 
made  possible  by  the  European  Research  Office,  through  GRANT  DA  ERO  - 
75  - G 084. 


550 


REFERENCES 


1.  Alan  George,  "Nested  dissection  of  a regul  ir  finite  element  mesh", 

SIAM  J.  Numer.  Anal.,  10  (1973),  pp  345-363. 

2.  Alan  George,  "Solution  of  linear  system  equations:  direct  methods 

for  finite  element  problems", 

in:  V. A. Barker,  ed.f  "Sparse  Matrix  Techniques",  Springer  Verlag, 
Berlin,  1977. 

3.  B.L.Buzbee,  G.H. Golub  and  C.W. Nielson,  "On  direct  methods  for  solving 

Poisson's  equations,  SIAM  J.  Numer.  Anal.,  7 (1970),  pp  627-656. 

4.  B.L.Buzbee,  F.W.Dorr,  J. A. George  and  G.H. Golub,  "The  direct  solution 

of  the  discrete  Poisson  equation  on  irregular  regions",  SIAM  J. 
Numer.  Anal.,  8 (1971),  pp  722-736. 

5.  R.S. Varga,  "Matrix  Iterative  Analysis",  Prentice-Hall,  Englewood 

Cliffs,  N.J.  , 1962. 

6.  D.M. Young,  "Iterative  solution  of  large  linear  systems",  Academic 

Press,  New  York,  1971. 

7.  P.Concus  and  G.H. Golub,  "Use  of  fast  direct  methods  for  the  efficient 

numerical  solution  of  nonseparable  elliptic  equations",  SIAM  J. 
Numer.  Anal.,  10  (1973),  pp  1103-1120. 

8.  P.Concus,  G.H. Golub  and  1). P . 0 ' learv , "A  generalized  conjugate  gradient 

method  for  the  numerical  solution  of  elliptic  partial  differential 
equations", 

in:  J.R. Bunch  and  D.J.Rose,  ed.,  "Sparse  Matrix  Computations", 

Academic  Press,  New  York,  1976. 

9.  D.P.O'leary,  "Hybrid  conjugate  gradient  algorithms",  Ph.D. Thesis, 

Comp.  Science  Dept.,  Stanford  Univ. , 1975. 

10.  O.Axelsson,  "A  generalized  SSOR  method",  BIT  13  (1972),  pp  443-467. 

11.  O.Axelsson,  "A  class  of  iterative  methods  for  finite  element 

equations",  Dept,  of  Comp.  Science,  Goteborg  Univ.,  1975. 

12.  J.R. Bunch  and  D.J.Rose,  ed.,  "Sparse  Matrix  Computations", 

Academic  Press,  New  York,  1976. 

13.  V. A. Barker,  ed.,  "Sparse  Matrix  Techniques",  Springer  Verlag, 

Berlin,  1977. 

14.  J.A.Meijerink  and  H.A.van  der  Vorst,  "An  iterative  solution  method 

for  linear  systems  of  which  the  coefficient  matrix  is  a symmetric 
M-matrix",  Math,  of  Comp.,  31  (1977),  pp  148-162. 

15.  R.S. Varga,  "M-matrix  theory  and  recent  results  in  numerical  linear 

Algebra" , 

in:  J.R. Bunch  and  D.J.Rose,  "Sparse  Matrix  Computations",  Academic 
Press,  New  York,  1976. 

16.  A. van  der  Sluis,  "Condition  numbers  and  equilibration  of  matrices", 

Numer.  Math.  14  (1969),  pp  14-23. 

17.  O.Axelsson,  "Solution  of  linear  systems  of  equations:  iterative 

methods", 

in:  V. A. Barker,  ed.  "Sparse  Matrix  Techniques",  Springer  Verlag, 
Berlin,  1977. 


551 


COMMENTS  ON  THE  SOLUTION  OF  COUPLED  STIFF  DIFFERENTIAL  EQUATIONS 


M.  D.  Kregel 
J.  M.  Heimerl 

U.S.  Army  Ballistic  Research  Laboratory 
Aberdeen  Proving  Ground,  Maryland  21005 

ABSTRACT.  The  K-method  of  integrating  sets  of  ordinary  differen- 
tial equations  is  outlined.  This  algorithm  employs  a variable  step 
size,  third  order  predictor-corrector  method.  It  is  written  to  reduce 
truncation  error  and  to  maximize  stability  consistent  with  reasonable 
execution  time.  The  K-method  has  been  used  to  integrate  equations 
with  discontinuous  driving  functions  and  this  algorithm  conserves 
chemical  balance  within  the  machine  round-off  error.  It  has  been 
applied  to  solve  kinetics  problems  in  aeronomy  and  examples  are  taken 
from  that  field. 

INTRODUCTION.  In  the  late  1960's  the  aeronomy  branch  at  the  BRL 
needed  the  solutions  to  sets  of  stiff  ordinary  differential  equations 
(ODES)  that  describe  the  positive  and  negative  ion  chemistry  in  the 
earth's  D-region  (~  60-85  km).  Adequate  mathematical  techniques  for 
handling  stiff  ODES  were  unknown  to  us  at  that  time.  Kregel,  a physicist, 
approached  this  problem  empirically  and  developed  a stiff  ODE  integrator. 
This  report  sketches  the  K-method  of  integration  and  provides  a few 
examples  of  the  uses  of  this  integrator.  Our  aim  is  two-fold:  to  interest 
mathematicians  so  that  this  method  may  be  placed  on  a firmer  mathematical 
foundation  and  to  inform  potential  users  so  that  they  may  apply  this 
algorithm  to  their  particular  problem. 

TWO  BASIC  PROBLEMS.  Before  we  discuss  the  sketch  and  the  examples 
let  us  briefly  review  two  problems  that  are  basic  to  any  numerical 
solution  to  ODES,  stiff  or  not.  These  problems  are  truncation  error  and 
stability.  According  to  Dahlquist  and  BjOrck, 1 ..." (truncation  errors) 
are  committed  when  a limiting  process  is  broken  off  before  one  has  come 
to  the  limiting  value."  Truncation  errors  result  from  mathematical 
approximations.  For  example  they  arise  when  a finite  series  approxi- 
mates an  infinite  series  or  when  a linear  function  approximates  a non- 
linear one. 

Stability,  or  its  better  known  opposite,  instability,  is  associated 
with  the  idea  of  feedback. z As  the  name  implies,  part  of  a program  or 
code  has  a loop  in  which  the  numbers  produced  at  the  output  of  one  cycle 
are  used  as  the  input  for  the  next  cycle.  The  errors  associated  with  these 
numbers  may  then  be  amplified  in  such  a way  as  to  destroy  the  solution. 

^Numerical  Methods,  by  G.  Dahlquist  and  A.  Bjdrck,  Trans,  by  N. 

Anderson,  1974,  Prentice  Hall,  Inc.,  Englewood  Cliffs,  NJ,  p.  22. 

2 

See  for  example  Numerical  Methods  for  Scientists  and  Engineers,  by 

R.  W.  Hamming,  2nd  Edition,  1973,  McGrawIlill,  Inc.  p.  5. 
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The  purpose  in  recalling  these  nemeses  is  to  justify  the  effort 
that  has  gone  into  the  K-method  algorithm  to  reduce  truncation  error  and 
to  maximize  stability  consistent  with  a reasonable  execution  time. 

SKETCH  OF  THE  K-METHOD.  Figure  1 schematically  shows  the  main 
functional  steps  in  the  K-method  algorithm.3  A third-order  predictor- 
corrector  method  is  employed.  As  soon  as  the  initial  corrector  is 
formed,  the  diagonal  of  the  Jacobian  (i.e.  3y.'/3y)  is  examined  to 
determine  those  dependent  variables  that  are  stiff  and  those  that  are 
not  (not  shown  in  Fig.  1).  This  is  done  in  order  to  select  the  method 
with  the  least  computational  overhead  for  updating  each  of  the  initial 
predictor  values. 

At  this  stage  of  the  algorithm  neither  the  predictor  values  nor 
corrector  values  are  presumed  acceptable  and  an  error  vector  is  gener- 
ated from  their  difference.  This  vector,  in  conjunction  with  the  Jacobian 
including,  now,  the  off-diagonal  elements,  is  used  to  modify  the  pre- 
dictor values.  The  modification  to  the  predicted  values,  which  involve 
a matrix  inversion,  are  first  attempted  iteratively  using  a Gauss-Seidel 
method.  During  the  iteration,  checks  are  made  on  the  estimated  computa- 
tional overhead  burden.  Should  the  iteration  method  prove  too  tedious, 
the  remaining  "non-converged"  correction  elements  are  solved  by  direct 
matrix  inversion.  The  corrector  is  recomputed  and  another  error  vector 
is  generated. 

Truncation  error  is  then  checked  by  comparing  the  fourth  derivative 
of  each  dependent  variable  against  a predetermined  relative  error  toler- 
ance. If  any  one  of  these  variables  fails  this  test,  the  truncation 
error  is  judged  "poor",  the  step  size,  h,  is  reduced  by  a factor  of  two 
and  this  cycle  of  the  computation  is  begun  anew.  This  test  is  especially 
useful,  as  we  shall  see  later,  in  the  case  of  discontinuous  driving 
functions. 

When  all  the  dependent  variables  have  passed  the  truncation  test, 
(i.e.,  are  "OK"  in  Fig.  1)  a check  is  made  on  the  predictor-corrector 
agreement.  Should  all  elements  of  the  predictor-corrector  difference 
vector  be  less  than  a predetermined  minimum  error  (i.e.  are  "OK"),  the 
corrector  values  are  accepted  as  the  solutions  at  this  time  step  and  the 
step  size  is  adjusted  for  the  next  cycle.  If  any  element  of  this  differ- 
ence vector  is  greater  than  a predetermined  maximum  error  tolerance  the 
agreement  is  judged  "poor,"  the  step  size  is  reduced  by  a factor  of  two 
and  this  cycle  begun  anew.  For  the  intermediate  case,  in  which  all 
elements  of  this  difference  vector  are  less  than  the  maximum  error  test, 
and  in  which  at  least  one  is  greater  than  the  minimum  error  test,  the 
difference  vector  is  judged  "so-so".  When  this  is  the  case  the  predicted 


An  earlier  version  has  been  reported.  See  "Description  and  Comparison 
of  the  K-Method  for  Performing  Numerical  Integration  of  Stiff  Ordinary 
Differential  Equations,"  by  M.  D.  Kregel  and  E.  L.  Lortie,  BRL  Report 
No.  1733,  July  1974,  AD  #A00385S. 
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values  are  again  modified  and  the  process  begun  anew  until  an  acceptable 
solution  is  obtained.  It  is  important  to  note  that  the  K-method  conserves 
both  charge,  if  any,  and  chemical  balance  to  within  the  round-off  error 
of  the  machine  since  the  corrector  formulation  itself  is  conservative. 

The  methodology  outlined  in  Figure  1 and  above  shows  that  a solution 
over  one  time  step  is  considered  valid  if  the  corrected  values  map  into 
the  predicted  values  within  a specified  error  tolerance.  Since  small 
differences  in  the  predicted  values  are  magnified  by  a factor  of  the 
order  of  the  stiffness,  we  have  attempted  to  formulate  a predictor- 
corrector  scheme  which  has  minimum  error  and  maximum  stability.  How 
this  has  been  empirically  achieved  is  outlined  below. 

To  be  solved  are  vector  equations  which  may  be  cast  in  the  form  of 

Y'  = F - RY,  (1) 

where  Y is  the  dependent  vector,  and  a function  of  the  independent  variable 
X,  F the  formation  vector  [ = F(Y,X)]  and  R the  removal  vector  f = R(Y,X)]. 
The  predictor  is  chosen  to  be  quadratic  in  form;  i.e.. 


= YQ  + A(Xj  - XQ)  + B(X1  - XQ) 2 , 


(2) 


where  (X  - X^)  is  the  local  step  size  and  where  the  subscript  zero 
denotes  the  current  location.  A and  B are  to  be  determined.  To  do  this 
we  require  two  independent  equations.  One  is  obtained  from  equation  (2) 
by  "looking  backward"  from  the  current  location,  Xq,  to  the  time  X_j. 

In  this  case  Y^  is  replaced  by  Y j which  has  been  evaluated,  i.e., 

Y_j  = Yq  + A(X_j  - XQ)  + B(X_1  - XQ)2  . (3) 

Obviously  another  equation  could  be  written  for  Y ^ but  an  alternate,  more 
stable  and  error-free  method  has  been  found.  Consider  the  derivative  of 
equation  (2)  at  Xp  namely. 


Yx  = A + 2B(X1 


V 


and  the  predictor  in  the  form  of  equation  (1), 


= F 


P P 
R1  Y1 


(4) 


(5) 


Substituting  the  definition  of  Yj  from  equation  (2)  into  equation  (5) 
and  equating  the  right  hand  sides  of  equation  (4)  and  equation  (5)  we 
have 


IB  (X 


V = FiP  • RiP[Yo  + A(xi  ■ V + B(xi  ■ v2] 


(6) 


p p 

Provided  F^  and  R^  are  known,  equation  (6)  provides  the  second  equation 
required  to  determine  A and  B. 
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Let  us  now  consider  how  F^r  and  are  determined.  Figure  2 shows 
the  current  and  three  previous  discrete  values  of  the  formation  element 
for  a given  dependent  variable.  A parabola  of  the  form 

F (X)  = Cx  + C2(X  - XQ)  ♦ C3(X  - XQ) 2 , (7) 

is  passed  through  these  four  points.  The  C.'s  (j=l,2,3)  of  equation  (7) 
are  found  from  a least  squares  fit  where  th^  function  to  be  minimized 
with  respect  to  the  is 

l W.[F.  - PCX.)  I2,  - 3 *5  i < 0 . (8) 

i 

The  W.  are  weighting  functions,  chosen  so  that  periodic  fluctuations  in 

1 p 

the  P\  values  will  not  propagate  into  F . (The  relative  weights  are 
determined  by  adding  a quantity  a(-l)1  to  each  of  the  four  F.  and  requir- 

p p 1 

ing  that  the  identical  F.  be  found.)  R.  is  found  in  a similar  fashion. 

p p 1 1 

Once  F and  R.  are  determined,  A and  B can  be  determined,  and  conse- 

1 P 1 
quently,  Y^  . 

The  corrector  is  given  by 


= V-2 


V-l 


+ A,,  Y 


0 0 


(X, 


- VCDY^ 


BY,, 


+ CYjr  ),  (9) 


where  D and  B are  preselected  to  minimize  both  truncation  error  and 
noise  amplification,  and  to  maximize  relative  stability.  Values  for  the 
coefficients  in  equation  (9)  can  be  found  in  Reference  3. 

EXAMPLES.  We  shall  now  consider  three  examples  of  the  types  of 
problems  the  K-method  has  been  called  upon  to  handle.  They  are  all 
drawn  from  the  field  of  aeronomy.  Figure  3 shows  a linear  plot  of  a 
piecewise  continuous  driving  function  [Q(t),  solid  line].  The  disconti- 
nuities are  instantaneous  since  they  were  formed  by  reading  from  a DATA 
block  with  (J)  and  (J  + 1 ) subscripts  interchanged.  (The  reverse  of 
each  of  the  slopes  in  Figure  3 gives  the  desired  driving  function,  which 
is  still  discontinuous  in  the  first  derivative.)  The  dashed  lines  are  the 
response  of  the  electron  density  and  a primary  positive  ion  density,  here 
the  nitric  oxide  ion,  as  a function  of  time.  The  curves  have  been 
vertically  displaced  for  ease  in  reading.  It  is  seen  that  the  K-method 
enables  the  dependent  variables  to  follow  the  input  discontinuities  of  the 
driving  term.  (Departures  from  a per ft  t matching  of  the  input  slopes  can 
be  explained  by  chemistry  competing  with  the  driving  term.)  This  example 
demonstrates  the  effect  of  careful  monitoring  of  the  truncation  error. 


556 


The  second  example.  Figure  4,  shows  histograms  of  the  number  of 
species  that  lie  in  the  decade  interval,  h/r.,  where  h is  the  local  step 
size  and  where  t.  is  the  instantaneous  characteristic  time  constant  of 
the  ith  species.  (The  total  area  under  each  histogram  corresponds  to  64 
species.)  The  numbers  to  the  far  right  are  the  decade  model  times  in 
seconds  (i.e.,  computer  execution  time  runs  from  bottom  to  top  in  this 
figure).  The  histograms  are  divided  into  a stiff  segment  to  the  right 
of  the  daghed  line,  and  a non-stiff  segment  to  the  left.  On  the  first 
plot  (10  seconds)  only  a fe^  species  are  stiff  while  at  the  upper 
limit  of  this  integration  (10  seconds)  the  number  has  significantly 
increased,  with  stiffness  factors  (h/r^)  greater  than  10. 

The  last  example  is  shown  in  Figures  5 and  6.  Figure  5 shows  the 
log  of  the  input  or  driving^functiqn,  q(e),  plotted  against  the  log  of 
time.  In  the  interval  10  - 10  seconds  those  negatively  charged 

species  more  strongly  coupled  to  the  driving  function,  (e.g.  e . or 
0,  ) follow  the  discontinuities  exhibited  by  the  driving  term.  These 
details  tend  to  be  "washed-out"  in  the  species  tha£  are  weakly  coupled 
to  the  driving  term  (e.g.  CO,  or  CO^  ).  Near  10  seconds  the  strongly 
coupled  species  again  track  tne  discontinuities  in  q(e).  The  dynamic 
range  in  the  dependent  variables  is  about  six  orders  of  magnitude. 

Figure  (6)  shows  the  broad  range  response  of  the  neutral  specie j.  Thi| 
graph  shows  those  species  that  follow  the  driving  term  [e.g.  0(  D) , N(  D)], 
those  that  are  independent  of  the  driving  term  [e.g.  N-,0,  CO]  and  those 
that  tend  to  be  chemistry  dominated  [e.g.  II,,  HNO,,  N~0,_]. 

In  summary,  we  have  empirically  derived  a third  order,  variable 
step  size  method  for  efficiently  handling  stiff  ODES  that  appears  to 
work  for  discontinuous  driving  functions.  We  anticipate  with  some 
further  work  that  this  method  will  be  put  on  a f ; rmer  mathematical  foun- 
dation. 
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Figure  1.  Schematic  of  K-method  of  integration;  h is  the  step  size. 


Figure  2.  Schematic  of  a parabolic  fit  to  the  values  of  F at  the  current  and  three 

previous  times  for  one  of  the  dependent  variables.  F-jP  is  found  by  continuing 
this  parabola  to  the  timeX,. 
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Figure  5.  Modeled  response  of  the  negative  ion  densities  to  the 
driving  function  q(e). 
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Figure  b.  Modeled  response  of  the  neutral  densities  to  the  same 
driving  function  of  Fig.  5. 
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o 


Rolf  Jeltsch 
Institute  of  Mathematics 
Ruhr-Uni  vers  i ty  Bochum 
D-  4630  Bochum.Germany 


ABSTRACT.  Brawn  | 1 | introduced  k-step  methods  usino  ! derivatives. 
Necessary  and  sufficient  conditions  for  A -stability  and  stiff  stability 
of  these  methods  are  given.  These  conditions  are  used  to  investigate  for 
which  k and  i the  methods  are  A -stable.  It  is  seen  that  for  all  k 
and  l with  k < 1.5  (f  + 1)  the  methods  are  A0-stahle  and  stiffly 
stable.  This  result  is  conservative  and  can  be  improved  for  f sufficient- 
ly large.  For  small  k and  t A0-stability  has  been  determined  numerical- 
ly by  implementing  the  necessary  and  sufficient  condition. 


1.  INTRODUCTION.  To  solve  numerically  the  initial  value  problem 


(1) 


y'(x)  = f ( x . v(x ) ) 


y(a)  = n 


we  shall  need  the  higher  derivatives 
f(°)(x,y)  = f(x,v)  and 


f(  J ) i 


fi  i ) 


(x,y) 


3 f<j'1)(x,y)  + •— f^'^tx.vjffx.y) 


3X 


j = • 


a + mh  , and  ym  be  approximations  to  tbe 


^ ^ X-  'IS.  JVCpjliC)  /s™  . ,,,  _ . _ 

exact  solution  y(xm)  of  (1).  Moreover  let  f ^ J ) - f(.i ) (xm,ym)  . Brown  | 1 | 
has  introduced  methods  of  tfce  form 

(3)  irn"  = r,  hJpjfnik1)  with  ."  '“i1  >n  ’ ■ lfj'  >0  • 

1 = 0 J=1  J 1 = 0 J = 1 J 

n = 0,1,2,...  . Here  a,-  and  Pj  are  constants  such  that  the  method  has 
highest  possible  error  order.  In  Jeltsch,  Kratz  | 13  i it  was  shown  that 


(3a) 

(3b) 


l 


= (-l)k_i(^)  (k-i)_t  , i = 0,1,. ..,1-1 

et  = ( - 1 ) j/ j ! V (-l)k_i  (h  (k-i)j'#  , j 

J i=o 


0,1 ; 

A method  of  form  (2) 


fl(hp+2)  ,C 


n 


p+1 

In  |13I  it  was  shown 


Here  we  have  used  the  natural  convention  that  ^ = - t-Q 
is  said  to  have  error  order  p if 
k j 

T.  n-  y(x+ih)  - T.  h^S  .y^fx+kh)  = C hD**y  ^P+  * ' (x ) 

i=o  j=l  3 

for  all  sufficiently  often  differentiable  functions  v(x)  . 
that  the  methods  given  by  (2)  and  (3)  have  the  error  order  p - k + c - 1 and 
that  there  is  no  method  of  form(2)with  a hioher  error  order.  We  shall  call  the 
formula  (2)  with  (3)  Brown's  (k,e)-method.  It  was  discussed  in  ( 131  for  which 
k and  i Brown's  (k,f)-method  is  stable  and  for  which  it  is  not.  Since  the 
method  belongs  to  the  class  of  k-step  methods  with  t derivatives  and  nil  it 
converges  if  and  only  if  it  is  stable  (see  e.g.  Griepentroo  | fi  1 , Spijker  | 15  | ). 
Convergence  is  also  shown  for  strongly  stable  methods  in  Brown  ( 1 | , I 2 1 . 
Assuming  strong  stability  an  estimate  for  the  alobal  discretization  error  as  well 
as  the  first  term  in  the  asymptotic  error  expansion  are  oiven  in  | 1 | , (21  . In 
the  present  article  we  give  necessary  and  sufficient  conditions  for  the  methods 
These  conditions  are  then  used  to  investigate  for  which  k and 


l Brown's  (k,f.)-methods  are  A.-stable,  see  Fig.  1.  For  k and 


small  this  has 


been  done  by  implementing  the  criterias  on  the  computer.  It  is  proved  that  Brown's 
* This  article  has  been  supported  bv  the  European  Research  Office. 
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(k.e)-methods  are  A0-stable  and  stiffly  stable  if  k £ 1.5  (t  + 1)  . For  any 
a < aj  o,  3 . 5*?  1121 48  it  is  Ag-stable  and  stiffly  stable  for  k < o (t  + 1) 
and  t sufficiently  large.  Computer  results  suggest  that  i 2 ? is  already 
large  enough.  Finally  we  would  like  to  mention  that  Brown's  (k.l)-methods  are 
the  well  known  Backward  Differentiation  Methods  BIM . In  Liniger  I 14  I it  has 
been  shown  that  the  BCM  are  A0-stable  and  strongly  stable  for  k £ 5 . '■ear 
17  | showed  by  plotting  the  boundary  of  the  region  of  absolute  stability  that 
the  BOM  are  stiffly  stable  for  k £ 6 . It  is  well  known,  see  Cryer  14  1 , 
Creedon,  Miller  | 3|  that  the  BCM  are  unstable  for  k > fi  . 


In  the  next  section  we  derive  necessary  and  sufficient  conditions  for  ^0- 
stability  of  methods  of  form  (2).  In  Section  (3)  these  conditions  are  used  to 
derive  our  main  results  on  Brown's  ( k , s ) -methods  while  the  more  technical  proofs 
are  postponed  to  the  last  section. 


2.  Necessary  and  sufficient  conditions  for  A?-stahility.  Applying  the 
method  (2)  to  y'  = Ay  leads  to  the  recurrence  relation 

(3)  ff0  °i  Vi  - j,  Vh>)j  Vk  ■ 0 • " - °-1 

This  is  a linear  homogeneous  recursion  "elation  with  constant  coefficients  and 

thus  its  characteristic  equation  is 

(A)  4(t.u)  ■ p(c)  - u n(u)  = 0 . 

Here  we  have  used  the  abreviation  v - hx  and  the  polynomials 

k f 

(5)  p(0  = I o,  t , n(u)  - T.  8.  uJ 

i=o  j = l J 

Let  q(u)  > i = l,2,...,k  be  the  k branches  of  the  algebraic  function  c(u) 
defined  by 


(6)  »(t(u),u)=0  . 

Then  every  solution  of  (3)  can  be  written  in  the  form 

(7)  Vn  - l * n,(n)  r"(u)  . 

j=  1 J 1 

where  n j ( n ) are  polynomials  in  n with  constant  coefficients  whose  degree  is 
less  than  the  multiplicity  of  Cj(u)  as  a root  of  t(c,u)  . * in  (7)  indica- 
tes that  the  sum  is  only  taken  over  those  j for  which  the  branch  Cj(u)  is 
finite.  Clearly  Cj(0)  are  the  roots  of  p(c)  . A method  is  called  stable  or 
more  precisely  Dahlquist  stable  if  lci(D)l  £ 1 for  i = 1,2,. ...k  and  if 
I Ci  (0)  I = 1 then  q(0)  is  a simple  root  of  p (r.)  , see  e.g.  Henrici  |9)  , 
Stetter  [ 16,  p.  2o6 ) . A method  is  convergent  if  it  is  stable  and  the  error 
order  p exceeds  0 , Griepentrog  [ 8 | , Spi jker  I 15  I . It  is  well  known  that 
for  a convergent  method  there  is  exactly  one  branch  of  c(u)  which  becomes  1 
at  u = 0 . We  call  this  branch  the  principal  branch  and  denote  it  by  tj(u)  . 
Hence  r.,(0)  = 1 and  Cf(0)  * 1 for  j = 2,3 k . A method  is  called  strong- 

ly stable  if 

(8)  U1(0)l  <1  , i = 2,3 k 


In  the  weakstability  analysis  of  methods  one  is  interested  in  the  asymptotic 
behaviour  of  yn  given  by  (7)  as  n tends  to  infinity  and  h is  kept  fix. 

Hence  one  is  interested  in  the  region  of  absolute  stability 

(9)  A = (u  6 f | Ic-tu)!  < 1 , i = 1,2 kl 
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According  to  Cryer  I 51  a method  is  called  A„-stah le  if  A contains  the  negative 
real  axis,  i.e.  (-",*1)  c A . Gear  | 7 | introduced  the  concent  of  stiff  stability 
which  we  shall  use  in  the  rigorous  form  oiven  in  Jeltsch  | 11  I , | 12  I . Since 
Cl(0)  is  a simple  root  of  n(r)  the  principal  branch  ■)(,)  of  the  algebraic 
function  ?(u)  is  analytic  in  a neighborhood  of  u = 0 , Let  be  the  largest 
star  region  into  which  the  principal  branch  has  an  analytic  continuation.  The 
set 


(10)  R = ((£81  I r- i ( u ) I < i ' j ( ii ) I , i = 2,3 kl 

is  called  the  regi  on  of  relative  stahi  I ity.  Let  0 , •»  and  a be  positive 
constants  and  define  the  sets  Rj  = {u  £ t I Re  ,<-(!>  , 

R;>  = {«  £ t | ReuS-a  , 1 1 m u I < n)  and  R3  --  in  < C I I Re  ; I < a , I ! m 
A method  is  called  stiffly  stable  if  it  is  convergent  A0-stable  and  if  for  some 
positive  D , 0 and  a one  has  Rj  u Rp  ‘ ^ and  R3  r R 


THEOREM  1.  A method  of  form  (2)  with  an  error  order  p ? 1 is  stiffly  stable 
if  and  only  i f it  is  AQ-stable  and  strongly  stable. 


This  theorem  follows  immediately  from  Theorem  3 in  .leltsch  | 12  | . The  above 
theorem  is  one  important  reason  whv  we  are  interested  in  finding  necessary  and 
sufficient  conditions  for  A0-stabi 1 ity.  In  the  following  we  shall  derive  such 
conditions  for  methods  of  form  (2)  which  satisfy 

(11a)  (-l)j  + 1 hj  > 0 , j = 0,1 t 


(lib)  ok  = -p0  * 0 , pf  < 0 


(11c)  p(l)  = 0 


However  these  methods  need  not  to  he  convergent.  He  introduce  the  standard  trans- 
f orma  t i on 


(12) 


r+1  2+1 

z = 771  ’ ' = 7-1 


which  maps  the  unit  disk  of  the  /-plane  onto  the  left  half  plane 
H"  = {z  e l 1 Re  z < ill  of  the  z-plane.  Accordinoly  wi  introduce 


(13) 


*(*.«)  = (z-l)k  l(~,  > 

= r( z ) - .,(.  ) (z+l)k  = 0 


where 

(14)  r(z)  = (z-l)k  o(|4)  ^ - a 2* 

j = 0 

Assume  that  p(r.)  has  a root  / = 1 of  exact  mnl  tipi : city  m . Because  of  (11c) 
one  has  m 2 1 . Clearly  in  most  applications  one  will  have  n = 1 . Since  the 
transformation  (12)  maps  c = 1 into  2 = ■ the  polynomial  r(2)  is  of  exact 
degree  k - m . Let  z(u)  be  the  algebraic  function  defined  bv  (z(u),i)  = n 
and  Zj(u)  , i = l,2,...,k  , its  branches.  Clearly  a method  is  A0-stahle  if 
Re  Zj(»)  < 0 for  all  u £ ( — .0)  and  all  i - 1,2,. . . ,k  . Deviding  (13)  hv 
one  finds  immediately  that  z,(-)  = -1  for  i = 1,2,.  ..,k  . F >-om  (11)  follows  that 
un(u)  < 0 for  all  u £ | -»,0)  and  therefore  v(z,u)  is  a polynomial  in  z of 
exact  degree  k for  any  u e | . Hence  z i ( 1, ) is  alwavs  finite  for  all 

u £ ( -”,0)  and  all  i = 1,2,. . . ,k  . The  branches  Zj ( ) can  he  defined  such  that 
these  are  functions  of  u £ (-“,(1)  . Hence  the  method  is  A0-stahle  if  and  only  if 
none  of  the  graphs  Zj(u)  intersects  the  imaninary  axis  for  . e ( - • ,0 ) . Hence 
we  have  shown  the  following 


LCTMA  1.  Assume  a method  of  form  (2)  satisfies  (11).  Then  the  method  is 
A0-stable  if  and  only  if 

(15)  r(iy)  - un  (u)  (iv  + l)k  ' 0 

for  all  u e ( — ,0)  and  all  y 6 r . 

Clearly  Lemma  1 is  not  a constructive  criteria  for  A0-stahility  since  (15) 
has  to  be  tested  for  infinitely  many  pairs  n and  y . In  order  to  formulate  a 
constructive  criteria  for  A0-stabilitv  we  introduce  the  polynomials 

(16a)  P(t)  :=-y"*  Im  1 r( iv) ( - i y + 1 )k 1 

and 

(16b)  0(t)  : = Re  f r( i y)(-iy+l  )k  1 

where  r :=  y . P(t)  and  0(t)  are  real  polynomials  of  degree  k - 1 . Let 
be  all  non  negative  ronts  of  P(t)  . 

THEOREM  2.  Assume  a method  of  form  (2)  satisfies  (11).  Then  the  method  is 
A0-stable  if  and  only  if  the  following  two  conditions  hold. 

V 0(0)  = r(0)  = (-l)kg(-l)  > 0 

Q(tj)  - 0 for  j - 5 

Note  that  a convergent  method  always  satisfies  Ai  . Clearly  condition  A^  is 
satisfied  if  P(t)  has  no  non  negative  roots  at  all.  this  oives  the  following 

COROLLARY  1.  Assume  a method  of  form  (2)  satisfies  (11).  Then  the  conditions 
Bj  , are  sufficient  for  A0-stabilitv. 

Bj:  r(0)  = (-l)k  p(-l)  > 0 

82^  P ( t ) * 0 for  all  t ( | 0,‘«) 

REMARK.  Condition  Bj  is  easily  checked  and  condition  Bp  can  be  checked  in 
finitely  many  additions,  subtractions,  multiplications  and  divisions  using  the 
Sturm  algorithm,  see  Werner  (17,  p.  132  1 , Dickson  ( 6,  p.  83  I . Bp  is  trivially 
satisfied  if  all  coefficients  of  P(t)  have  the  same  sion. 

PROOF  OF  THEOREM  2.  (15)  is  equivalent  to 

(17)  r(iy)(iy*l)  * un  (u)  , u £ (•“■,0)  , y € P 

From  (11)  follows  that  un(u)  is  a polynomial  which  maps  (-■»,0)  onto  (-*,0)  . 
Hence  by  Lenina  1 and  (17)  a method  is  Ao-stable  if  and  only  if 

(18)  r(1y)(iy+l)*k  * (— ,0)  for  all  y 6 1R 
But  by  (16)  one  has 

0(y2)  - iv  P(y?) 

(19) 


Hence  one  has  AQ-stabi 1 ity  if  »nd 

1 


= r( iv)(-iy+l)k 
« r( iv)( iy+1 )"k  (l+v?)k  . 
only  if 


568 


IP 


(20)  0(y?)  - iy  P( v?)  1 (—  ,0)  for  all  vr 

(20)  is  now  obviously  equivalent  to  the  conditions  A.  »nd  A,  . 

1 L. 

We  would  like  to  recall  that  in  Theorem  2 we  did  not  assume  that  the  method 
is  stable  or  has  an  error  order  p z 1 . The  reason  is  that  thrnuqh  slight  modi- 
fications of  the  conditions  Aj , one  ohtains  a criteria  which  aives  AQ-stahili- 
ty  and  strong  stability. 


THEOREM  3.  Assume  a method  of  form  (2)  satisfies  (11)  and  has  an  error  order 
p > 1 . Then  the  method  is  A0-stable  and  stronolv  stable  if  and  only  if  the  follo- 
wing three  conditions  hold. 


Cl: 

C?: 

C3' 


a0  » r(0)  - (-1)'  p(-1)  > 0 

0( T 1 ) > 0 for  ,j  1.2, . . ,s 
ak-i>n  . 


PROOF.  (I)  Necessity.  Assume  that  the  method  is  A0-stable  and  strongiv  stable. 
Hence  A j and  Ap  hold.  Since  c - -1  is  not  a root  of  r(-)  one  has  C],  Strong 
stability  together  with  p > 1 implies  that  - - 1 a simplr  root  of  r()  , 
that  is  m * 1 . As  we  have  seen  earlier  this  implies  ak_j  t 0 . However  since 
r(z)  is  a polynomial  with  all  roots  in  the  left  hand  plane  H we  have  that  a0 
and  a.  , have  the  same  sign.  This  implies  C3.  We  have  already  seen  that  z^(u) 
is  continuous  for  u e | -*,0)  for  i = 1,2, . . . ,k  . Since  a.  0 and  at _ j ' 0 

one  of  these  roots  Zj(p)  , let.  us  denote  it  by  z j ( u ) , wi  1 i tend  to  infinity  as 
11  tends  to  0*  while  the  other  ones  remain  finite.  Hence 

(21)  lim  _ z ■ (u)  - z .(0)  i 2,3 k 

u > 0 

are  the  roots  of  r(z)  . Strong  stability  implies  that  Re  Zj(0)  <.  0 . Hence  (IS) 
has  to  hold  for  all  u 6 (-“",01  and  all  y 1 IP  . As  in  the  proof  of  Theorem  2 
this  implies  that 

(22)  0(y2)  - iy  P(y2)  f (-■  ,0  I for  all  y • IR 
From  (22)  follows  immediately  Cp. 

(II)  Sufficieny.  From  C,  and  Cp  follows  by  Theorem  1 that  the  method  is  A„- 
stahle.  Cy  ensures  that  o(cf  has  a simple  root  at  1 . Hence  we  find  as  in 
(I)  that  (21)  holds.  From  Ci  and  Cp  follows  that  (22)  holds  and  hence 

Re  Zi(u)  < 0 for  all  u e ( ,0  | ‘and  i - 2,3 k . In  particular 

Re  Zi(0)  < 0 for  i = 2,3,...,k  . Hence  the  method  is  strongly  stable. 

ARK  1.  Using  Theorem  1 we  see  that  ronditions  C 1 -C 3 are  necessary  and 
sufficient  For  a method  of  form  (2)  with  (11)  and  p > 1 to  he  stiffly  stable. 

2.  Note  that  a convergent  method  always  satisfies  Cj  and  C3. 

Condition  is  satisfied  if  P(t)  has  no  non  neoative  root.  This  gives  the 

COROLLARY  2.  Assume  a method  of  form  (2)  satisfies  (11)  and  has  p > 1 . Then 
Oj - D3  are  sufficient  for  A0-stabi 1 i ty,  strong  stability  and  stiff  stability. 

Dji  aQ  = r( 0)  > 0 

Dp:  P(t)  to  for  all  r e | 0,-) 
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°3: 


Vl  > 0 


As  we  shall  see  in  the  next  section  Corollary  2 will  he  very  powerful  for 
the  investigation  of  Brown's  (k , t )-methods . For  completeness  we  oive  the  follo- 
wing result. 

THEOREM  4.  Assume  a method  of  form  (2)  satisfies  (11)  and  has  an  error  or- 
der p > 1 . Then  the  method  is  A0-stable  and  Oahlquist  stable  if  and  only  if  the 


conditions 

E j-E 4 hold. 

El: 

a0  ■ r(0) 

> 0 

. 

E2: 

0(Tj)  > 0 

for 

j = 1.2, 

E3: 

3k-l  > 0 

• 

E4* 

■f  0(Tj) 

= 0 

then  z = 

1/2 


is  a simole  root  of  r ( z ) 


The  proof  goes  along  the  lines  of  the  proofs  of  Theorem  2 and  3 and  is  there- 
fore left  to  the  reader. 

3.  A -st.abi  1 ity  of  Brown 's  (k  )-method.  In  this  section  we  shall  investiga- 
te for  which  R and  t Brown's  (k ,t )-metho(T is  A^-stable.  It  is  clear  that  if  a 
method  is  unstable  due  to  a root  of  p(r)  which  lies  outside  the  unit  circle  then 
the  method  is  also  not  A0-stable.  Hence  from  .leltsch,  Kratz  I 13  1 follows  that  to 
any  fix  J Brown's  (k,?)-method  is  not  A0-stable  for  all  k large  enough.  However 
if  we  fix  k then  Brown's  (k,()"mpthod  becomes  A -stahle,  strongly  stable  and 
stiffly  stable  for  all  ? larne  enouoh.  To  show  this  let  us  first  show  that  Brown's 

followino  relation  holds 


j=0,l r-1  . 


(k.e)-method  satisfies  (11).  In  I 13  | it  was  shown  tha£  the  followin 

(23a)  V(-l)k'1',(k)(k-i)j't.  v tj1  , ’V1  >n 

i=o  tj=l  t2-i  Vi  ' i 

For  j = t one  finds  through  direct  calculation 
k-1  . , . . 

(23b)  T.  = 1 > n . 

i = o 

Using  (23)  in  (3b)  one  finds  that  (11a)  and  (lib)  hold.  Note  that  we  could  have 
established  (11a)  and  (lib)  also  from  the  fact  that  Brown's  methods  are  Hermite 
interpolatory , see  Theorem  9 in  I 10  | . The  assumption  (11c)  is  satisfied  since 
Brown's  methods  have  the  error  order  p = k t t - 1 > i , 


(24) 


THEOREM  5.  Brown's  (k,e)-method  is  A0-stahle,  and  strongly  stable  whenever 
3 

s 7 


PROOF.  We  show  this  usina  Corollary  2.  From  (14)  and  (3)  onp  obtains 

k-1 

(25)  r(z)  = T.  (-I)k*'  (k)  (k-i)‘f((z*l)'(z-l)k'1  - (z*l)k'  . 

i = o 

Hence 

(26) 
and 
(27) 


I L _ p k _ i 

ao  = r <1>  (k_i)  fl  - (-1)  1 > 0 

i =o 


k- 1 


* 2 V (-l)k'M(k)  (k-i ) 

i-o 
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(26)  implies  that  Dj  holds  and  (27)  together  with  (23)  show  that  O3  holds.  In 
the  remaining  part  of  the  proof  we  shall  show  that  D;>  holds  whenever  (24)  is 
satisfied.  From  (26)  and  (16a)  we  find  through  an  easy  but  tedious  computation 
that 


(28) 

wi  th 


P(r) 


* 2V  (-l)k'1'J(![)<k-.j)"t(U,)j  E (^:J.)( --l)k“1-J~2t  4t(-r)t  , 
j -0  J t=o 


(29) 


, I (k-l-j)/2  I 


Here  I a | denotes  the  largest  integer  not  exceedina  a . In  order  to  show  that 
P(t)  has  no  non  negative  roots  we  have  to  distinguish  the  followina  cases 

I.  l < 7 . Direct  calculation  on  a TR  440  in  double  precision  (i.e.  81-84 
bit-mantissa)  showed  that  all  coefficients  of  P(t)  are  positive  whenever  k 
satisfies  (24).  Hence  P(t)  > 0 for  all  r e ( 0 . • ) and  hence  Dj  is  satisfied. 
II.  l > 7 . We  split  P(t)  as  follows 

(30)  i P(t)  = S(t)  * T(t) 

where 

k-1  . 

(31)  S(r)  = T.  (-l)k‘1'j  (k)  (k-j)1't(l  + ,)j(i-l)k'1'J 

.1-0  J 

and 

(32)  T(t)  • kF3  (-l)k'1'j(k)(k-j)''(H,)J  ' (k;i1)(r-l)k'1'j‘2t4t(-T)t  , 

j=o  J t=l  i 1 

where  t is  given  by  (29).  We  shall  need  the  following  two  lemmata. 


LEMMA  2.  Let  1 < k < 3.3  (f  + 1)  and  t > 8 . Then  one  has 

(33)  S(r)  > k(l+T)k"2  for  all  t e | I),-) 

Moreover  let  a < , where  r>j  is  the  uniaue  positive  root  of 

(34)  q(o)  = 1 
wi  th 


-1- 


1 


(35)  q(n)  e ? 

Then  (33)  holds  for  t sufficiently  large.  . ^ is  approximately  3.59112148. 
LEMMA  3.  Let  3 < k < 1.5  (?  f 1)  and  t > 8 . Then  one  has 

(36)  I T ( t ) I < -J  k(k-l)(k-2)232"f(l4T)k‘?  for  all  - c |ll,,)  . 


In  order  not  to  interrupt  the  idea  of  the  proof  we  postpone  the  proof  to 
Section  4.  For  k = 1,2  one  has  from  (32)  that  T(>)  =0  . Hence  by  Lemma  2 and 
(30)  one  has  P(t)  > 0 for  all  t£|  l),»)  . Thus  condition  D;>  holds.  In  the  folio 
wing  let  k > 3 . From  (30),  (33)  and  (36)  follows 

5 P ( t ) > S(r)  - I T ( , ) | > (T  + l)k'2  (k  - i k(k-l)(k-2)232'fi 

for  all  t 6 |0,<»)  , 3 < k < 1.5  (t  ♦ 1)  , e > 8 . Hence  P(t)  has  no  non  nega- 
tive root  if 


(37) 


3*  > | (k-l)(k-2)2 


Since  the  right  hand  side  of  (37)  is  mnnotonicallv  increasing  in  k it  is 
enough  to  show  (37)  for  k = 1.5  (t  *1).  Hence  it  regains  to  show  ttiat 

(38)  3*  > ^ (3»  + 1)(3*  - 1)?  . 

However  it  is  easy  to  see  that  (38)  holds  for  t > 8 . 

Collecting  the  results  from  I and  II  we  find  that  P(>)  >0  for  all 
t <=  (0,-)  for  1 < k < 1.5  (i  + 1)  and  t = 1,2,...  . Hence  D?  is  satis- 
fied. n 


Clearly  in  the  proof  of  (24)  we  have  made  some  conservative  estimates.  He 
can  improve  upon  (24)  if  we  consider  the  methods  only  for  • large  enough. 

THEOREM  b.  Let  , < wj  where  aj  is  the  unique  positive  root  given  by 
(34)  and  (35),  q s 3.59112148  . Then  Brown's  (k.r)-method  is  A0-stable  and 
strongly  stable  for  all  k with 

(39)  k < a(,  * 1) 

and  e sufficiently  large. 

As  we  shall  see  from  Table  1 computer  results  suggest  that  for  all  a < j 
it  is  in  fact  in  Theorem  6 enough  to  take  t > 2 . 

PROOF  OF  THEOREM  6.  As  in  the  proof  of  Theorem  5 we  use  Corollary  2.  We  have 
alreacfv  seen  that  Dj  and  Dj  are  always  satisfied.  In  order  to  snow  Dp  we  split 
P(t)  as  in  (30)  ana  need  estimates  for  S(t)  and  T(t)  . Tor  a < «,  Lemma  2 
gives  the  lower  hound  (33)  for  1 sufficiently  larqe.  The  following  temma  gives 
an  upper  bound  for  T(r)  . 


IFMMA  4.  Let  n€  (1,»)  . Then  there  exists  1 >0  such  that 

(40)  |T(t)l  < ^ n4  r.4  3'11  ( l + i)k~2  for  r£|0,.| 

k < at  and  t > L 

a 

We  postpone  the  proof  to  section  4.  Let  , < t,,  then  usina  (30),  (33)  and 
(40)  one  has 


j P(t)  > S(T)  - 

for  I 6 |0,*)  , 1 < k < a 
no  non  negative  root  if 

(41) 


I T(  T)  I > (l  + i)k'?  ! k AV'l 

e + 1)  and  f sufficiently  larqe.  Hence  P(:)  has 


However  (41)  will  always  hold  for  c sufficiently  larqe.  Hence  O2  holds  for 
sufficiently  large.  □ 


Note  that  Theorem  6 gives  Dahlquist  stahilitv  for  -.  < j , k < •,  ( e ♦ 1 ) 
and  i sufficiently  large.  The  existence  of  such  a result  has  already  been 
suspected  in  | 1 3 | . 


The  proofs  of  Theorem  5 and  h are  based  on  Corollary  2.  Both  results  are 
conservative  since  the  estimates  made  in  lemma  2,  3 and  4 are  bad.  This  means  that 
P ( t ) satisfies  D2  even  if  k is  somewhat  larger  than  1.5  (f  *■  1)  respectively 
aj(t  + 1)  . hut  how  much  larger?  To  investigate  this  we  have  computed  numerically. 
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using  a TR  440,  the  numbers  n 


and 


which  are  defined  as  follows. 


DEFINITION  1.  Let  it  be  such  that  the  coefficients  of  P(t)  are  all 
positive  for  E k v,  and  at  least  one  is  negative  for  k = + 1 . Let 

vj  be  such  that  D-,  holds  for  k < ve  and  D£  is  violated  for  k = vs  + 1 . 
Let  oj  be  such  that  Brown's  (k,e)-method  is  AQ-stable  and  strongly  stable 
for  k <oj  but  is  not  for  k = + I . Let 


min  (k 


p(c)  of  Brown's  (k.t)-method  has  a 

1 . 

root  outside  the  unit  disk 

Since  Brown's  methods  always  satisfy  Cj,  0],  C3,  D3  one  has  uj  < vj  < ot  < 


Note 

that 

one 

can 

easily  show 

that 

"0 

> 4 

for 

? > 

1 . The 

for 

V l * 

v,  , 

°t 

and 

are  given 

in  Table  1. 

t 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

’« 

4 

6 

9 

12 

16 

20 

24 

29 

35 

vt 

fi 

10 

14 

18 

22 

27 

31 

°t 

fi 

10 

14 

18 

ut 

7 

11 

15 

1° 

24 

2Q 

34 

40 

46 

53 

59 

Table 

1 1 . Values  of  v p 

* vt  ’ 

°t 

and 

“t 

whi  ch 

are  < 

defined 

Note 

that 

in 

I 13  1 

i t has 

been 

seen  for 

e = 

1.2. 

. . . ,11 

method  is  strongly  stable  for  1 < k < u(  - 1 . It  is  surprising  that  for  * < 4 


one  has 


= Uj  - 1 and  hence  ct( 


- 1 . The  numerical  results  are 


col lected  in  F ig.  1 . 

4.  PROOF  OF  LEMMATA.  We  shall  need  the  following 

LEMMA  5.  Let  n e (1,31  . Let  nn  be  the  unique  positive  root  of 

(42)  q(")  = l 

where  q(a)  is  given  by  (35).  Then  to  any  n < .n  there  exists  wn(a)  such  that 

(«)  n(  k ) j < ( k ) — ' — T • ’ * 1.2 *-1 

V i - 1 / (k-i  + 1)*  \ i / (k-i) 

for  e > wn(u)  and  k < a (i  + 1)  . The  values  for  as  well  as  wn(o)  for 
some  a are  given  in  Table  2. 

n a a 


3.59112148 

1.65687525 


3.3 

1.5 


Table  2. 


PROOF.  (43)  is  equivalent  to 

(k-i+lf 


(44) 


f(i)  = 


(i  - 1 / 


( ) 


(k-i) 


for  i = 1,2,.. 


,k-l 


and  we  ask  for  which  values  of  k and  t is  (44)  satisfied.  We  replace  the  inte- 
ger variable  by  a continuous  variable  s and  determine  the  maximum  of 


f(s)  = F17T  ( FsTt) 


for  s e I = [l,k-l]  . f(s)  is  a rational  function  with  no  poles  in  I and 
f(s)  > 0 in  I . From 

f'(s)  * (k~s-)-  (k(k+l)-s(«+k+l)) 

(k-s+l)l+^ 

follows  that  f(s)  has  a critical  point  at 


s' 


k(k+l) 

t+k+1 


This  must  be  a maximum  point  since  f'(s)  is  continuous  and  different  from  zero 
in  (0,k)  - (s')  . f‘(0)  > 0 and  f'(k-e)  < 0 for  c > 0 , c sufficiently  small. 
Hence  (44)  is  satisfied  if  f(s')  < 1/n  . Consider 

gt(k)  :=  f(s')  * yTT  ( (k+l)C«+l) ) 

which  is  a monotonical ly  increasing  function  in  k . Hence  (44)  is  established  if 


h(a,t)  :=  gt(a(t+l))  = a 


(TTtrnJ 


satisfies 

(45) 


h(a,t)  < 


1 


Observe  that 
(46) 


lim  h(a,t)  = a e ° = q(a) 


l -*•  00 

Clearly  q(a)  is  monotonically  increasing  for  ae  (0,»)  . Hence  (42)  has  a unique 
positive  solution  an  • Let  a < an  therefore  q(o)  < 1/n  . From  (46)  follows  that 
there  exists  wn(a)  such  that  h(a,e)  < 1/n  for  t a wn(o)  and  thus  (43)  holds. 
In  order  to  be  able  to  compute  actually  w^fa)  qiven  n and  a , observe  that  for 
a fix  a the  function  h(a,l)  is  monotonically  decreasing  in  t . Thus  it  is 
enough  to  verify  that  h(a,wn(a))  < 1/h  • D 


PROOF  OF  LEMMA  2.  Let  us  write 

k-1 


where 

S(t)  = T.  s. 

i=o 

(47) 

si  :=(")  (-ft)1  n-T)"-1  . 

t e [ 0,1  ) . From  (47)  follows  that  s..  > 0 . 

i = 0, 

,1,...  ,k-l 

I.  Let 

Hence 

(48) 

S(t)  > s|(.1  = k( 1+r )k~ 1 S k(l+t)k'?  for 

T 6 I 

0,1  1 

II.  Let 
follows 

t G (1,»)  . From  (43)  with  n a 1 , a = 3.3 
that 

< Oj 

and  Wj(6 

7 


(49) 

Moreover 

(50) 


IsJ  < IS.  I < IS, | < 

0 1 z 


< lsk-l' 


k-l-i 


> 0 . Hence 


S(t)  > s|(.i  + s)(-2  = k(  1+t ) 


k-1 


for  118 

k 


k-2  , k t.„t  + k + k tfc-U ) . 


= (1+t)'  ‘ (-7(2  -k  + 1) 
2l 


2 / 2 
k 

7 


77  (UT)k’2(l-T) 
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Clearly 

(51)  2e  - k + 1 > 2*  - 4 ( s ♦ 1 ) + 1 > 0 for  i > 5 
and  all  k with 

(52)  1 < k £ 4 (t  + 1)  . 

Hence  from  (51)  and  (50)  follows 

(53)  S(T)  > k(l+T)k'2  for  , e|l,.)  . 

To  prove  the  second  part  of  Lemma  2,  we  observe  that  (48)  is  valid  for  all  k 
and  t . Moreover  from  Lemma  5 follows  that  for  a < nj  and  i sufficiently 
large  (49)  holds  whenever  k < a (t  + 1)  . In  (50)  one  has  always  2C  - k ♦ 1 

for  i sufficiently  large  and  k < a (f  + 1)  . Thus  (53)  holds  too. 

PROOF  OF  LEMMA  3.  Since  It-11  < I t ♦ 1 1 and  M < I t + 1 1 for  x e |0,») 

one  finds  from  (31)  that 

k'3  /k\  -»  < l / k‘J\  L.I.i.O,  ♦ 


I t ( t ) I < r. 

j=o  \j 


(k-J)  ( 1+t)j 


(r+l)k_1'J'2t  4l( t+1 ) 1 


■ (T+l)k'2  V ( k ) (K-j)'*  I ( k J ) 4t(T  + l)1't  . 

j=0  \ j / t=l  \ 2t+l  / 

For  t e [0,"")  one  has  that  (t+1)*  * < 1 for  all  t > 1 . Hence 

4 4 (k'j)^> 

t=l  \2t+l / c t=l  \2t+l  / 


V lwJ 


2s  = | (l+2)k'j 


and  thus 

(54)  |T(t)I  < (r+l)k‘2  \ T.  ( ) (k-jf'  3k‘j  for  Te|0,») 

j=0  \ j ' 

We  apply  (43)  of  Lemma  5 with  n = 3 , a = 1.5  < <13  and  W3(ct)  = 8 . Hence 

(55)  I T( T) I < ( r+1  )k-2  ||  J 33_e(k-2)  for  ,e|n,-) 

and  k > 3 . (55)  is  obviously  equivalent  to  (34).  □ 

PROOF  OF  LEMMA  4.  From  (54)  in  the  proof  of  Lemma  3 we  have 


(56) 

k > 1 

1 T ( t ) 1 < i ( 1 + T)k"2  T.  C • . 

i = 3 

and  t > 1 , where 

for  all 

t e in,-) 

(57) 

'“•04 

, i 

= 3,4 k 

Let  a 

1 

€ (1,»)  and  we  introduce 

(58) 

mjk)  :=  max  (ci(,  1 ■ 

i = 3,4,. 

...kl 

Since 

cik<ci,k  + l °"ehas 

(59) 

m,(k)  < mf(k+l)  Tor 

all  k 

and  s 

iOUnHMMUHHMUia 


□ V 


Let  M be  a bound  for  me(|nt|  ) . Then  one  has 
|T( t) I S \ (HT)k'2(k-2)mf(k) 

(60) 

S j (1+t)  at  M(  for  t € | 0,-)  , k = 3,4 | at  | . 

To  simplify  the  notation  we  introduce 

(61 ) r : * I at  I 

and 


<62)  ci  :=  • 1 * 3*4 * 

We  say  that  the  sequence  c-  » i = 3,4,...,*-  has  a relative  maximum  of  order 
q at  i * s if  3 < s , s1*  q - 1 < * 


(63a) 

and 


cs-l  < 


^s+l 


cs+q- 1 


(63b) 


C 1 > c 
s+q-1  s+q 


In  order  to  locate  these  relative  maximas  we  consider 

(64)  di  -C-TT-  3TTT  (j+T^1  • 1 - 3*4, • • . ,k"1  ■ 

Cc  is  a strict  relative  maxima  of  order  1 if  d , > 1 and  d < 1 . We  re- 
place the  integer  variable  in  (64)  by  the  continuous  variable  x , thus 

(65) 


f(x)  : = 3 77T  (7n')  ’ x e 1 3,r‘n 


Since 

(66) 

f(x) 

(67) 
Hence 

(68) 


f (0)  = f(x)  = 0 , f(x)  >0  for  x <=  (0,0  and 

x'_1 

f • (x)  = 3(rt-x(t+l+«r))  X , 

(x+l)t+Z 

has  exactly  one  maximum  point  in  (0,0  , namely  at 

7 = 

C + l+r 

if  f(x)  < 1 then  the  sequence  has  no  relative  maximum  and 
V at|  ) = c3 


If  f(x)  > 1 then  there  are  two  numbers  x”  and  x+  such  that 
0<x-<x<x+<k  and  f(x")  = f(x+)  = 1 , f'(x')  > 0 , f’(x+)  < 0 . Assume 
further  that 


(69) 


x+  - x”  > 1 and  «■  - x+  > 1 


then  the  sequence  c.  has  exactly  one  relative  maxima,  namely  c , of  order 
q , where  s 

s = x+  , q = 2 if  x+  e M 

(7°) 

s = | x 1 + 1 , q = 1 if  x+  ? H . 

Since  everything  has  to  be  done  for  t large  it  is  more  convenient  to  normalize 
the  interval  (0,0  introducing  the  new  variable 
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(71) 

Hence 

g,(t)  : 

S i nee  i 

(a t- 1 )/a£ 

(72) 

ti">  9, 

l oo  1 

Let  tt 

= x/a? 

(73) 

1 im 

t = ± 

u? 


at 


t + 


(‘  ' ^rn 


e “*  •:  q(t)  , t e 1 0.-) 


at 

~r 


1 +t +t 


1+01 


*:  t 


t is  the  unique  maximum  point  of  q(t)  , as  ran  he  verified  directly  from  (72). 
Sinre  , 

-1-  - 

9(1)  = g(l/(l+n))  = 3 n e 

is  a monotonic  increasing  function  in  n there  is  exactly  one  positive  i with 

(74)  9<l/(l+«))  = 1 • 


a = 03  , as  qiven  in  Tahle  2,  is  the  positive  root  of  (74),  since  (74)  is  identi- 
cal to  (42)  with  n = 3 . Hence  for  any  1 < a < nj  there  eixsts  a positive  L 
such  that  one  has 

(75)  »((|[ill)  =c3  for  i>  Lo 

Now  let  a > . Let 

(76)  tj  = x'/n t , t*  = x+/oi  . 

Then  the  limits  of  t~  and  t*  as  i tends  to  infinity  exist.  Let 

(77)  t : = 1 im  t'  , t+  = 1 im  t* 

l to  p « 

and  thus 

(78)  0 < t’  < t < t+  < 1 

and 

(79)  g(t')  = q(t+)  = 1 
g'(t')  > n , q’(t')  < n . 

Irj  order  to  locate  t*  observe  that  t+  depends  on  , . Thus  denote  it  bv 
t (a)  . t+(o)  is  given  implicitely  by  (79),  that  is 

1_ 

(80)  3 e 1 = 1 

t+(n) 

Implicit  differentiation  of  (80)  shows  usinq  (78)  that 
>0  for  n e („3,„) 
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Hence 


(81) 

lim  + t+(a)  < t+( a)  < 

lim  t*(a 

“ * Q3 

rj  -*  oo 

However 

from  (77)  and  (73)  follows 

that 

(82) 

lim  + t+(a)  > jJ-  = 0. 

37638199/3 

Since 

Q -*  aj  J 

lim  g(t)  =3  — ^ for 

t <=  (0,» 

we  find 

a • 

for  tQ  :=  lim  t+(a)  the 

equati on 

(83) 

a ► ® 

i-t 

3 = 1 

From  (83)  we  find  t = 3/4  and  this  qives  toaether  with  (81)  and  (82) 


< t + (n)  < 3/4  . 

Using  (77)  we  see  that  for  any  oe  (a,,« 
( 2 L1  one  has  J 


there  exists  l'  such  that  for  all 


and  hence  by  (76) 
(84) 


I 

t+o. 


1 

TfnZ 


< < 3/4 


oe  < x < 


From  (78),  (77),  (83)  together  with  (76)  follows  that  there  exists  to  any 
oe  (o3,»)  a positive  L"  > L „ such  that  (69)  is  satisfied  for  all  e > L"  . 
From  (70)  one  sees  that  there  exists  to  any  (03,-)  a positive  L'"  > L" 

such  that  any  relative  maxima  c^  of  the  sequence  satisfies 


(85) 


0.3  of  < s < 0.8  oi 


for  e > L'q  . For  a = 03  it  might  he  that  c-  has  a relative  maximum  even 
though  g(t)  < 1 . However  from  ' 

lim  t = 3—^ — = 0.376 
? -*  ® 3 

follows  that  there  exists  L'^  such  that  for  t > L",'  a possible  relative  maxima 

satisfies  (85)  with  a replaced  hy  r» 3 . Since  x"  corresponds  to  the  only 
possible  minimum  of  c.-  one  has  that  for  n€  1 r,-,*)  and  r > L'" 

1 3 a 

me(<)  = max  (c^.Cj  I 0.3  of.  < j < fl.  8 c*  1 1 

We  shall  show  that  for  ? large  enouqh  one  has  m (*)  = c,  . From  (62)  and  (57) 
we  find  9 J 

’1  x!  31  _ r(x+i)  . 3' 

' ■ ( ' 1 ) ! ,•«  ' r(i  + l )r(x-i  + 1 ) 

where  r(x)  is  the  Gamma  function.  Hence  for  j with  .i/ote)  0.3.0.8)  and 
l sufficiently  large  one  has 

1861  Cj  - 3!  r(r-2)  Jj~34t  , „ r(nt-2)  3jtt 

c3  r(j+l)r(x-j+l)  K1  r(j+l)r(a  t-.i)  .t 

where  Kj  = 2/9  . Here  we  used  that  the  Ganna  function  r(x)  is  monotonical ly  in- 


3 <•  3' 
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creasing  for  x > 1 and  that  at  - j >.  at  0.2  > 1 for  t sufficiently  large. 
For  all  x > 1 we  can  use  the  crude  version  of  Stirlings  formula 


(87)  \ (£)  < r(x+l)  < 2 (£) 


(88) 


ci 

t*  K2 


Using  (87)  in  (86)  one  finds 

1 


t - 


6.  (°')' 


where  t = j/at 


and 


t (i  - -4  ) 

1 a C 


K2  = .7=^  e2  Kj 


8 

(1  - -y)  3tn  + ’ 
' ol 


For  al  > max  { 

(89) 

and 

(90) 


Ft 


,6  1 we  have 


tto  + 1(l-t-  i) 
+ 1 


Rp  < P 


. ta  + 1. 1-tv 


j < 32  K, 
C3  2 


(!) 


i 


s-2.6 


« - t n 


01- to 


Since  8 and  a are  independent  of  t it  follows  from  (9o)  and  (89)  that  for 

any  ae  (an,®)  there  exists  l > L'"  such  that 
o ^ no 

-i  < 1 for  t = -i-  e | n.3,0.8  I , t>L 


Hence 


(91) 


me(  I otl  ) = c3  for  ».  > 


However  from  (62),  (61)  and  (57)  one  has 


(92) 


9 

7 


3,3 


= : 


(60)  together  with  (75),  (91)  and  (92)  give  the  desired  estimate  for  all 

l > L , k < at  . 
a 


□ 
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AN  EXAMPLE  OF  APPARENT  CONVERGENCE  TO  THE  WRONG  LIMIT 


Walter  Gautschi 
Mathematics  Research  Center 
University  of  Wisconsin-Madison 
Madison,  Wisconsin  53706 

ABSTRACT.  One  owes  to  Perron  the  continued  fraction 


M' (a,b;z)  _ _a (a+1 )z  (a+2)z  . . . 

M(a,b;z)  ~ b-z+  b+l-z+  b+2-z 

for  the  logarithmic  derivative  of  Kummer's  function.  It  is  pointed 
out  here  that  the  continued  fraction  exhibits  a strong  phenomenon 
of  apparent  convergence  to  the  wrong  limit,  when  |z|  is  large  and  Re  z > 0. 
The  phenomenon,  in  a weakened  form,  may  persist  when  Re  z < 0,  but  disap- 
pears for  z = -x  < 0 and  0 < a + 1 _<  b.  The  matter  is  further  illustrated 
in  the  special  cases  of  Bessel  functions  (a  = v + 1/2,  b = 2v  + 1 ) and 
incomplete  gamma  functions  (b  = a + 1).  The  complete  paper  under  the 
title  "Anomalous  convergence  of  a continued  fraction  for  ratios  of  Kummer 
functions"  is  available  as  MRC  Technical  Summary  Report  #1711,  January 
1977,  and  is  to  appear  in  Mathematics  of  Computation,  October  1977. 


583 


^ PRECEDING  PAGE  BjUNK-NOT  pilmed 

~ J 


FLOATING  POINT  COMPUTATION  FACILITIES 
FORA 

COMMON  PROGRAMMING  LANGUAGE  FOR  THE  DOD 


David  A.  Fisher 

Science  and  Technology  Division 
Institute  for  Defense  Analyses 
Arlington,  Virginia  22202 


ADSTRACT.  This  paper  discusses  some  of  the  considerations  and  tradeoffs  that  led  to 
the  technical  requirements  for  the  floating  point  facilities  for  a common  programming  language 
for  the  DoD.  Some  implications  for  the  requirements  in  designing  and  implementing  a 
language  are  also  given. 

1.  INTRODUCTION.  The  Department  of  Defense  (DoD)  is  attempting  to  limit  the 
number  of  general  purpose  programming  languages  used  in  new  defense  systems.  Currently 
most  data  processing  in  the  DoD  is  programmed  using  COBOL  and  most  scientific  applications 
using  FORTRAN.  There  is  no  intention  to  change  this  situation. 

Most  of  the  costs  and  problems  of  software  in  the  DoD,  however,  are  associate d with 
embedded  computer  systems.  An  embedded  computer  system  is  integral  to  a larger  system  such 
as  an  electromechanical  device,  combat  weapon  system,  tactical  system,  aircraft,  ship,  missile, 
spacecraft,  command  and  control  system,  communication  system,  or  other  system  whose  primary 
function  is  not  computation.  It  also  includes  the  support  software  for  the  design,  development, 
and  maintenance  of  such  systems.  Data  processing,  scientific,  and  research  computers  are  not 
normally  included  among  embedded  computer  systems. 

Currently,  there  are  no  widely  used  languages  for  embedded  computer  systems.  Major 
benefits  to  the  DoD  are  expected  if  the  number  of  languages  used  in  embedded  computer 
applications  can  be  reduced.  Programming  languages  arc  neither  the  cause  of  nor  the  solution 
to  software  problems,  but  because  of  their  central  role  in  all  software  activity,  they  can  either 
aggravate  existing  problems  or  simplify  their  solutions. 

In  determining  an  appropriate  common  programming  language,  the  DoD  has  used  the 
following  three  major  selection  criteria: 

- The  language  should  support  unique  characteristics  of  embedded  computer  systems. 

Specifically,  software  for  embedded  computer  systems  must  be  able  to  interface  with 

nonstandard  input-output  devices,  must  be  able  to  respond  to  unplanned  error  situations, 
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must  be  able  to  meet  the  real-time  constraints  for  the  applications  and  devices,  and  must 
be  able  to  perform  several  tasks  concurrently. 

- The  language  should  support  the  needs  of  the  software  environment  for  defense  systems 
Embedded  computer  systems  are  often  large,  long-lived,  and  subject  to  continuous  change. 
They  must  conform  to  the  physical  and  real-time  constraints  of  the  associated  hardware. 
Software  reliability  and  automatic  recovery  from  failures  are  often  critical  to  defense 
systems. 


- The  language  will  be  suitable  as  a common  language.  Its  object  programs  must  run  on  a 
variety  of  computers,  its  definition  must  be  complete  and  unambiguous,  its  users  must  be 
able  to  specify  their  programs  in  terms  of  the  applications  (rather  than  in  terms  of  their 
object  representation),  and  the  language  must  have  a stable  definition  with  entoiceable 
standards. 


A recent  study  of  23  existing  languages,  including  many  currently  used  in  the  DoD,  has 
shown  that  none  is  appropriate  as  a common  programming  language  for  embedded  computer 
systems.  The  Services  have,  therefore,  undertaken  a joint  effort  to  obtain  a suitable  language 
through  programming  language  design  and/or  modification.  As  a first  step  they  have  agreed 
upon  a set  of  requirements  or  desired  characteristics  for  a common  language.  The 
requirements  reflect  the  major  criteria  at  a more  language-oriented  level  without  specifying 
individual  language  features. 


No  language 
been  made  for  r* 
consensus  of  tfi 
features  that 
ultimately  wil 


been  chosen  as  a common  DoD  language  and  no  determination  has 
ic  facilities  of  such  a language.  The  technical  requirements  are  a 
1 Services.  Any  examples  given  below  are  given  as  illustrations  of 
quirements.  They  are  not  necessarily  indicative  of  features  that 


This  paper  reviews  some  of  the  considerations  used  to  determine  the  technical 
requirements  for  the  floating  point  and  fixed  point  facilities  The  requirements  reflect  the 
needs  of  a command  language  for  embedded  computer  applications  and,  therefore,  may  be 
inappropriate  for  general  scientific  and  engineering  environments. 

This  paper  discusses  requirements  3- 1 A through  3-ID  as  reported  in  the  "Department  of 
Defense  Requirements  for  High  Order  Computer  Programming  Language  - Ironman",  High 
Order  Language  Working  Group,  M January  1977.  These  requirements  are  concerned  with 
the  floating  point  facilities  and  with  general  characteristics  appropriate  for  both  floating  point 
and  fixed  point  computations.  Syntactic  issues  are  not  discussed  here;  any  forms  vised  are  for 
illustrative  purposes  only  and  do  not  indicate  likely  choices  for  a common  language.  The 
appendix  contains  the  requirements  likely  to  affect  the  choice  of  numeric  facilities,  including 
those  for  fixed  point  arithmetic  (i.e.,  3-1E  through  3-1H). 
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2.  GENERAL  REQUIREMENTS 


3-1/ 1.  Numeric  Values.  The  language  shall  provide  types  for  integer,  fixed  point,  and 
floating  point  numbers.  Numeric  operations  and  assignment  that  would  cause  the 
most  significant  digits  of  numeric  values  to  be  truncated  ( e.g ..  when  overflow  occurs) 
shall  constitute  an  exception  situation. 

The  common  language  is  intended  for  a broad  class  of  embedded  computer  applications 
that  include  sensor  processing,  real  time  control,  simultation,  diagnostics,  counting,  record 
keeping,  and  display.  All  these  applications  require  numeric  computation  facilities  in  varying 
degrees  of  sophistication.  The  numeric  requirements  reflect  in  greater  detail  the  general 
criteria  (as  expressed  in  requirements  1-A  through  l-H)  for  a general-purpose  programming 
language  that  will  aid  the  development  of  reliable,  maintainable,  and  efficient  programs;  that 
will  incorporate  neither  unneeded  generality  nor  unnecessary  complexity;  that  will  be  practical 
in  implementation;  and  that,  where  possible,  will  be  machine-independent  and  formally  defined. 

Particularly  important  among  these  criteria  are  reliability  and  maintainability.  It  is  very 
difficult  to  understand  or  to  predict  the  action  of  programs  if  the  most  significant  digits  of 
numeric  values  can  be  inadvertently  lost  during  a computation  or  assignment  to  storage  Some 
languages  represent  all  numeric  values  modulo  the  woid  length  of  the  object  machine.  Such  a 
dctault  is  not  only  machine-dependent,  but  seldom  represents  the  intent  of  the  programmer. 
Consequently,  the  requirements  specify  that  loss  of  the  most  significant  (nonzero)  digits  of  a 
numeric  quantity  must  be  accompanied  by  an  execution  time  exception  situation  and 
subsequent  program  action  specified  in  the  program. 

3.  FLOATINGPOINT. 

3-lC.  Floating  Point  Precision.  The  precision  of  each  floating  point  variable  and 
expression  shall  be  specifiable  m programs  and  shall  be  determinable  at  translation 
tunc.  Precision  specifications  shall  be  required  for  each  floating  point  variable. 
Precision  shall  be  interpreted  as  the  minimum  precision  to  be  implemented  in  the 
object  machine.  Floating  point  results  shall  be  implicitly  rounded  (or  on  some 
machines  truncated)  to  the  implemented  precision.  Explicit  conversion  operations 
shall  not  be  required  betv/een  floating  point  precisions. 

3- ID.  Floating  Point  Implementation.  A floating  point  computation  may  be 
implemented  using  the  actual  precision,  radix,  and  exponent  range  available  in  the 
object  machine  hardware.  There  shall  be  built-in  operations  to  access  the  actual 
precision,  radix,  and  exponent  range  with  which  floating  point  variables  and 
expressions  are  implemented. 


587 


rr 


Floating  point  is  a system  for  the  approximate  representation  of  a wide  range  of  real 
numbers.  They  are  most  useful  for  numeric  quantities  whose  values  dynamically  vary  over  a 
wide  range  of  values,  but  a similar  number  of  significant  digits  is  desired  for  all  values.  Each 
floating  point  number  can  be  viewed  as  a triple  of  integers:  a mantissa,  m;  a base,  b;  and 
exponent,  e.  The  value  of  a floating  point  number  is  mbe  "^e  ranSe  values  that  can  be 
represented  depends  primarily  on  the  exponent  range  and  on  the  base  chosen.  How  closely  a 
real  number  can  be  approximated  depends  primarily  on  the  precision  of  the  floating  point 
representation.  By  precision  is  meant  the  number  of  digits  in  the  mantissa.  The  base  is 
usually  implicitly  represented  in  storage  and  unalterable  by  the  user. 

PRECISION.  Precision  is  a primary  concern  in  any  system  for  floating  point 
computation  The  precision  used  affects  the  computational  results  and,  therfore,  should  be 
known  to  the  user  Because  floating  point  hardware  implementations  typically  provide  more 
than  one  precision,  the  user  should  be  able  to  specify  the  minimum  precision  desired  in  a 
program. 

The  general  requirement  that  the  language  be  machine  independent  may  not  be  fully 
achievable  for  floating  point  computation,  given  the  multiplicity  of  existing  hardware 
implementations.  Nevertheless,  a language  can  be  defined  to  avoid  unnecessary  machine 
dependencies  and  to  minimize  the  effects  of  those  that  are  unavoidable.  Precision 
specifications,  for  example,  can  be  made  machine-independent  by  allowing  the  user  to  give  a 
numerical  specification  of  the  number  of  digits  of  precision  needed. 

Although  numeric  specification  of  precision  can  be  machine  independent,  efficiency  of 
implementation  dictates  that  the  corresponding  implementations  use  the  precisions  available  in 
the  object  machine  hardware  For  reliability,  the  language  must  require  that  the  implemented 
precision  be  at  least  as  great  as  the  specified  precision.  For  efficiency,  translators  should,  in 
each  case,  be  permitted  to  use  any  available  precision  at  least  as  large  as  that  specified. 

The  desired  precision  often  varies  within  a computation  and  even  within  an  expression. 
We  might,  for  example,  want  to  compute  the  double-precision  product  of  two  single-precision 
variables.  Consequently,  precision  should  be  specifiable  for  each  variable  and  each  expression 
(including  subexpressions). 

For  efficient  implementation,  the  required  precision  must  be  known  at  the  time  of 
translation.  Because  there  is  no  particular  precision  that  is  needed  most  often  (i.e.,  no 
appropriate  default),  the  language  should  require  user  specification  of  the  precision  of  each 
variable. 

The  desired  precision  for  intermediate  results  (i.e.,  expressions)  will  usually  be  the 
maximum  of  the  precisions  of  their  arguments.  Thus  (in  compliance  with  requirement  1C),  the 
default  precision  for  expressions  can  be  the  maximum  argument  precision  for  the  expression. 
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with  the  default  precision  explicitly  overridden  when  other  precisions  are  desired.  More 
precisely,  the  default  for  the  precision  to  which  an  operation  is  computed  and  to  which 
intermediate  results  are  saved  should  be  the  maximum  of  the  implemented  precisions  of  the 
arguments. 


A language  as  described  above  would  require  explicit  specification  of  the  precision  of 
variables,  permit  explicit  specification  of  the  precision  of  expressions,  and  have  a default 
precision  of  all  other  cases.  In  such  a language,  the  (implicitly  or  explicitly)  specified  precision 
will  be  known  at  each  stage  of  a floating-point  computation.  Thus,  although  more  than  one 
precision  may  occur  within  a computation  and,  therefore,  conversion  between  precisions  will  be 
required,  all  precision  conversions  can  be  implicitly  specified.  Conversion  between  precisions 
should  be  by  zero  fill  or  by  rounding.  Any  other  interpretation  leads  to  nomalies  and 
inconsistencies  (e.g.,  equal  becomes  nontransitive). 

The  only  detectable  error  situation  associated  with  precision  would  be  that  the  specified 
precision  exceeds  the  maximum  precision  available  in  the  object  machine  hardware.  Such 
errors  are  detectable  during  translation  The  language  should  permit  either  of  two  actions 
when  this  error  occurs.  Either  the  translator  does  not  compile  that  portion  of  the  program  and 
generates  a translation  time  error  message,  or  it  implements  the  specified  precision  using 
software  routines  and  generates  a translation  time  warning  that  the  computation  may  .re 
exceptionally  expensive  in  execution  (see  requirements  13F  and  13D,  respectively). 

ROUNDING  RULES.  The  system  proposed  above  is  a tradeoff  between  efficiency  and 
machine  independence.  Ft  allows  the  maximum  in  execution  efficiency  with  what  appears  to  be 
machine  independent  specifications.  The  machine  independence,  however,  depends  on  an 
assumption  that  rounding  is  used  throughout  the  floating  point  computation.  If  truncation  or 
other  nonrounding  rules  are  used  in  hardware,  the  significance  of  results  will  be  very  difficult 
to  predict,  and,  without  a detailed  understanding  of  the  object  machine  truncation  rules,  will  be 
impossible. 

The  language  should  not  permit  the  use  of  floating-point  hardware  with  unusual 
rounding  rules  to  implement  the  standard  floating  point  facilities  of  the  language. 
Nonstandard  floating  point  might  be  used  in  the  language  as  a user-defined  type  (see 
requirement  3C)  that  is  defined  as  a machine-dependent  feature  (see  requirement  1 ID). 

TRANSLATION  TIME  FUNCTIONS.  Even  with  rounding  of  floating  point  results  and 
explicit  numeric  specifications  of  precision,  programs  will  not  be  completely  machine 
independent  The  results  of  a floating  point  computation  will  depend  on  the  precision,  base, 
and  exponent  range  used  in  the  implementation.  Programs  that  account  for  differences  in 
precision,  base,  and  exponent  range  can  be  described  in  a machine-didependent  form  if  the 
language  provides  translation  time  functions  to  access  the  implemented  precision,  base,  and 
range. 
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Precision  could  be  accessed  by  a function  that  applies  to  any  variable  or  expression  and 
returns  the  actual  precision  used  to  implement  that  variable  or  function.  Normally,  there  is  a 
single  choice  of  radix  for  a given  machine,  so  the  base  could  be  accessed  by  a single 
parameterless  translation  time  function. 

The  exponent  range  is  the  major  factor  in  determining  the  range  of  values  that  can  be 
represented  in  a floating  point  representation.  The  range  of  values,  rather  than  the  exponent 
range  itself,  however,  is  of  greater  use  to  the  user.  The  range  of  values  depends  on  precision 
as  well  as  exponent  range.  This  suggests  a need  for  two  translation  time  functions,  each  taking 
a precision  as  their  argument.  A maximum-value  function  might  return  the  maximum  positive 
value  representable  in  the  implementation  of  the  specified  precision,  while  a minimum  value 
function  would  return  ihe  minimum  positive  value  representable  in  the  corresponding 
implemented  precision.  No  function  is  needed  for  the  most  negative  representable  value,  since 
it  will  normally  only  differ  in  sign  from  the  largest  positive  value 

FLOATING  POINT  OPERATIONS. 

3-1B.  Numeric  Operations.  . . . There  shall  be  built  in  operations  for  addition , 
subraction,  multiplication,  division  with  floating  point  result,  and  negation  for  all 
numeric  types.  There  shall  be  built-in  equality  (i.c.,  equal  and  unequal)  and  ordering 
operations  (i.c.,  less  than,  greater  than,  less  or  equal,  and  (■’<  . ter  or  equal)  between 
elements  of  each  numeric  type.  Numeric  values  shall  be  equal  if  and  only  if  they 
represent  exactly  the  same  abstract  value.  The  semantics  of  all  built-in  numeric 
operations  shall  be  included  in  the  language  definition.  [ Note  that  there  might  also 
be  standard  library  definitions  for  numeric  functions  such  as  exponentiation.) 

Three  classes  of  floating  point  operations  are  called  for:  arithmetic  operations,  relational 
operations,  and  library  routines. 

The  built-in  arithmetic  operations  will  be  addition,  subtraction,  multiplication,  division, 
and  negation.  These  are  the  operations  normally  piovided  in  floating  point  hardware.  For 
efficiency  of  implementation,  the  arithmetic  must  be  implemented  using  the  available  hardware 
operations  and,  therefore,  cannot  be  completely  denned  in  a language  that  must  produce  codes 
for  several  machines.  The  language  definition  can,  however,  specify  properties  that  must  be 
met  by  any  correct  implementation 

The  language  definition  might  require  that  any  single  floating  point  addition 
approximate  real  addition  within  specified  precision.  It  might  require  that  addition  be 
commutative,  and  that  it  be  nondecreasing  in  positive  arguments.  The  language  can  guarantee 
reasonableness  of  the  operations  while  excluding  few  floating-point  implementations 


A 


Equal,  unequal,  greater,  less,  greater  or  equal,  and  less  or  equal  floating  point  relational 
operations  will  be  provided.  Because  of  approximate  nature  of  floating-point  arithmetic, 
equality  between  floating-point  numbers  will  not  necessarily  imply  equality  among  the 
corresponding  real  numbers.  As  with  the  arithmetic  operations,  the  language  can  dictate 
certain  properties  for  the  relational  operations.  They  should,  except  for  unequal,  be  transitive; 
equal  and  unequal  should  be  commutative;  and  equal  should  be  reflexive.  Notice  that  equal 
between  arguments  of  different  implemented  precisions  will  be  transitive  only  if  the  lesser 
precision  is  expanded  to  the  greater  precision  using  zero  fill  before  the  comparison  takes  place. 
This  is  consistent  with  the  default  precision  of  expressions  mentioned  earlier,  and  implies  that 
floating  numbers  will  be  equal  if,  and  only  if,  their  normalized  representations  in  the  largest 
available  precision  are  identical. 

The  language  is  to  support  library  facilities  (see  requirement  I2A)  and  it  is  expected  that 
standard  library  definitions  for  mathematical  functions  will  be  made  available.  Users  of  any 
floating  point  facility  should  realize  that,  in  general,  a subroutine  cannot  produce  results  to  the 
precision  of  the  operations  comprising  its  body.  This  means  that  library  routines  will  generally 
have  to  do  internal  computations  to  a precision  greater  than  that  specified  for  their  results. 
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APPENDIX.  TECHNICAL  REQUIREMENTS  AFFECTING 
THE  NUMERIC  COMPUTATION  FACILITIES 

(excerpted  from  the  Ironman  document) 


The  technical  requirements  for  a common  DoD  high  order  programming  language  are  a 
synthesis  of  the  requirements  submit. ed  by  the  Military  Departments  They  specify  a set  of 
language  characteristics  that  are  appropriate  for  embedded  computer  applications  (i.e., 
command  and  control,  communications,  avionics,  shipboard,  test  equipment,  software 
development  and  maintenance,  and  support  applications).  The  subset  of  the  requirements 
reproduced  here  are  those  most  likely  to  influence  the  choice  of  numeric  computation  facilities. 

General  Syntax 

2B.  Grammar.  The  language  should  have  a simple,  uniform,  and  easily  parsed 
grammar  and  lexical  structure.  The  language  shall  have  free  form  syntax  and  should 
use  familiar  notations  where  such  use  does  not  conflict  with  other  goals 

2C.  Syntactic  Extensions.  The  user  shall  not  be  able  to  modify  the  source  language 
syntax.  In  particular  the  user  shall  not  be  able  to  modify  or  introduce  new  precedence 
rules  or  to  define  new  syntactic  forms. 

<cG.  Numeric  Literals.  There  shall  be  built-in  numeric  literals.  Numeric  literals  shall 
have  the  same  values  in  programs  as  in  data. 

Types 

3A.  Strong  Typing.  The  language  shall  be  strongly  typed.  That  is,  the  type  or  mode 
of  each  variable,  array  and  record  component,  expression,  function,  and  parameter  shall 
be  determinable  at  translation  time. 

3B.  Implicit  Type  Conversions.  There  shall  be  no  implicit  conversions  between  types. 

3C.  Typo  Definitions.  It  shall  be  possible  to  define  new  data  types  in  programs.  Type 
definitions  shall  be  processed  entiiely  at  translation  time.  The  scope  of  a type  definition 
shall  be  determinable  at  translation  time.  No  resiriction  shall  be  imputed  c.i  defined 
types  unless  it  is  imposed  on  all  types. 
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Numeric  Types 

3-1  A.  Numeric  Values.  The  language  shall  provide  types  for  integer,  fixed  point,  and 
floating  point  numbers.  Numeric  operations  and  assignment  that  would  cause  the  most 
significant  digits  of  numeric  values  to  be  truncated  (eg.,  when  oveiflow  occurs)  shall 
constitute  an  exception  situation 

3- ID.  Numeric  Operations.  There  shall  be  built-in  operations  (i.e.,  functions)  for 
conversion  between  numeric  types.  There  shall  be  built-in  operations  for  addition, 
subtraction,  multiplication,  division  with  floating  point  result,  and  negation  for  all 
numeric  types.  There  shall  be  built-in  equality  (i.e.,  equal  and  unequal)  and  ordering 
operations  (i.e.,  less  than,  greater  than,  less  or  equal,  and  greater  or  equal)  between 
elements  of  each  numeric  type.  Numeric  values  shall  be  equal  if  and  only  if  they 
represent  exactly  the  same  abstract  value.  The  semantics  of  all  built-in  numeric 
operations  shall  be  included  in  the  language  definition.  [Note  that  there  might  also  be 
standard  library  definitions  for  numeric  functions  such  as  exponentiation  ] 

Floating  Point  Type 

3- 1C.  Floating  Point  Precision.  The  precision  of  each  floating  point  variable  and 
expression  shall  be  specifiable  in  programs  and  shall  be  determinable  at  translation  time. 
Precision  specifications  shall  be  required  for  each  floating  point  variable.  Precision  shall 
be  interpreted  as  the  minimum  precision  to  be  implemented  in  the  object  machine 
Floating  point  results  shall  be  implicitly  i ounded  (or  on  some  machines  truncated)  to  the 
implemented  precision.  Explicit  conversion  operations  shall  riot  be  required  between 
floating  point  precisions. 

3- ID.  Floating  Point  Implementation.  A floating  point  computation  may  be 
implemented  using  the  actual  precision,  radix,  and  exponent  range  available  in  the  object 
machine  hardware.  There  shall  be  built-in  operations  to  access  the  actual  precision, 
radix,  and  exponent  range  with  which  floating  point  variables  and  expressions  are 
implemented. 

Integer  nnd  Fixed  Point  Types 

?>-lF.  Integer  and  Fixed  Point  Ni 
treated  as  exact  numeric  values.  T 
integer  and  fixed  point  computations 


Integer  and  fixed  point  numbers  shall  be 
lall  be  no  implicit  truncation  or  rounding  in 
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3-1F.  Integer  and  Fixed  Point  Variables.  The  range  of  each  integer  and  fixed  point 
variable  must  be  specified  in  programs  and  determinable  at  translation  time.  Such 
specifications  shall  be  interpreted  as  the  minimum  range  to  be  implemented.  Explicit 
conversion  operations  shall  not  be  required  between  numeric  ranges. 

3-1G.  Fixed  Point  Scale.  The  scale  or  step  size  (i.e.,  the  minimal  representable 
difference  between  values)  of  each  fixed  point  variable  must  be  specified  in  programs 
and  be  determinable  at  translation  time. 

3-1H.  Integer  and  Fixed  Point  Operations.  There  shall  be  built-in  operations  for 
integer  and  fixed  point  division  with  remainder  and  for  conversion  between  fixed  point 
scale  factors.  The  language  shall  require  explicit  scale  conversion  operations  whenever 
the  scale  of  a value  must  be  changed  to  properly  perform  some  operation  (eg., 
assignment,  comparison,  or  parameter  passing). 

Composite  Types 

3-3A.  Composite  Type  Definitions.  It  shall  be  possible  to  define  types  that  are 
Cartesian  products  of  other  types.  Composite  types  shall  include  arrays  (i.e.,  composite 
data  with  indexible  components  of  homogeneous  types)  and  records  (i.e.,  composite  data 
with  labeled  components  of  heterogeneous  type).  3-3D.  Array  Specifications.  The 
number  of  dimensions  for  each  airay  must  be  specified  in  programs  and  shall  be 
determinable  at  translation  time.  The  range  of  subscript  values  for  each  dimension  must 
be  specified  in  programs  and  shall  be  determinable  by  the  time  of  array  allocation.  The 
range  of  subscript  values  shall  be  restricted  to  a contiguous  sequence  of  integers  or  to  a 
contiguous  sequence  from  an  enumeration  type.  [Note  that  translators  may  be  able  to 
produce  more  efficient  object  programs  where  subscript  ranges  are  determinable  at 
translation  time.] 

3-3E.  Operations  on  Subarrays.  There  shall  be  built-in  operations  for  value  access, 
assignment,  and  catenation  of  contiguous  sections  of  one-dimensional  arrays  of  the  same 
component  type. 

Expressions 

4A.  Form  of  Expressions.  The  form  (i.e.,  context  free  syntax)  of  expressions  shall  not 
depend  on  the  types  of  their  operands  or  on  whether  the  types  of  the  operands  are  built 
into  the  language. 
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4B.  Typo  of  Expressions.  The  language  shall  require  that  the  type  of  each  expression 
be  determinable  at  translation  time.  It  shall  be  possible  to  specify  the  type  of  an 
expression  explicitly.  [Note  that  the  latter  requirement  provides  a way  to  resolve 
ambiguities  in  the  types  of  literals  and  to  assert  the  type  of  results;  it  does  not  provide  a 
mechanism  for  type  conversion  ] 

^C.  SicJo  Effects.  The  la  ngunge  should  permit  few  side  effects  in  expressions.  In 
particular,  during  expression  evaluation  assignment  shall  not  be  allowed  to  any  variable 
that  is  accessible  in  the  scope  of  the  expression. 

4D.  Allowed  Usage.  Expressions  of  a given  type  shall  be  allowed  wherever  both 
constants  and  variables  of  the  type  are  allowed. 

4E.  Constant  Valued  Expressions.  Constant  valued  expressions  (le,  expressions 
whose  values  are  determinable  at  translation  time)  shall  be  allowed  wherever  constants  of 
the  type  are  allowed  Such  expressions  shall  be  evaluated  before  execution  time 

^!F.  Operator  Precedence  Levels.  The  precedence  levels  (le,  binding  strengths)  of  all 
infix  operators  shall  be  specified  in  the  language  definition,  shall  not  be  alterable  by  the 
user,  shall  he  few  in  number,  (eg.,  three  or  four),  and  shall  not  depend  on  the  types  of 
the  operands.  [Note  that  there  might  be  built-in  operator  symbols  whose  meaning  is 
entirely  specified  by  the  user  ] 

^!G.  Effect  of  Parentheses.  Explicit  parentheses  shall  dictate  the  association  of 
operands  with  operators  Explicit  parentheses  shall  be  required  to  resolve  the 
operator-operand  association^  wherever  an  expression  has  a nonassociative  operator  to 
the  left  of  an  operator  of  the  same  precedence. 


Parameters 

7H.  Formal  Array  Parameters.  The  number  of  dimensions  for  formal  array 
parameters  must  be  specified  in  programs  and  shall  be  determinable  at  translation  time. 
Determination  of  the  subscript  range  for  formal  array  parameters  may  be  delayed  until 
execution  and  may  vary  from  call  to  call  Subscript  ranges  shall  be  accessible  within 
function  and  procedure  bodies  without  being  passed  as  an  explicit  argument. 

Specifications  of  Object  Representation 

11C.  Machine  Configuration  Constants.  The  language  shall  require  the  declaration  of 
certain  global  constants  of  the  object  machine  configuration.  These  shall  include 
constants  that  specify  the  machine  model,  the  memory  size,  special  hardware  options,  the 
operating  system  if  present,  and  peripheral  equipment.  Such  constants  shall  be  used  to 
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determine  the  object  code  to  be  generated  by  the  translator  and  may  also  be  used  by  the 
program  like  other  constants.  [Note  that  the  user  can  define  constants  and  use  them  as 
switches  to  control  user  defined  compilation  options.] 

11D.  Configuration  Dependent  Specifications.  It  shall  be  possible  to  use  machine 
dependent  facilities  in  programs.  Portions  of  programs  that  depend  on  the  characteristics 
of  the  object  machine  (e.g.,  on  the  machine  model,  special  hardware  options,  device 
configuration,  or  operating  system)  sh*"  * -mittcd  only  within  branches  of  conditional 
control  structures  that  discriminate  oe  i -^<r> ; configuration. 
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